home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-05-05 | 174.6 KB | 5,184 lines |
- \"Macro for putting levels 1 through 4 section headings in t.o.c.
- .de $0
- .if \\$3=1 \{\
- .(x
- \fB\\$2 \\$1\fR
- .)x
- \}
- .if \\$3=2 \{\
- .(x
- \fB\\$2\fR \\$1
- .)x
- \}
- .if \\$3=3 \{\
- .(x
- \fB\\$2\fR \\$1
- .)x
- \}
- .if \\$3=4 \{\
- .(x
- \fB\\$2\fR \\$1
- .)x
- \}
- ..
- \" end macro
- .\" use larger type so that it looks OK after photo-reducing
- .nr pp 12\" use larger point size
- .nr sp 12\" yep, I really mean it
- .nr tp 12\" and I'll mean it after other stuff
- .nr fp 10\" don't reset to 10 point (and use 9 footnotes)
- .sz 12\" believe me!!!
- .\"
- .\" USING THE EXODUS STORAGE MANAGER
- .\"
- .\"Macro the the storage manager version number
- .ds V "3.1
- .po 1.0i
- .ll 6.5i
- .fo ''%''
- .tp
- .sp 20
- .ls 1
- .ce 5
- .sz 14
- \fBUsing the EXODUS Storage Manager V\*V\fR
- .sz 12
- (Last revision: November, 1993)
- .sp 3
- .(f
- The Exodus software was developed primarily with funds provided by
- by the Defense Advanced Research Projects Agency under contracts
- N00014-85-K-0788, N00014-88-K-0303, and DAABO7-92-C-Q508
- and monitored by the US Army Research Laboratory.
- Additional support was provided by Texas Instruments, Digital Equipment
- Corporation, and Apple Computer.
- .)f
- .bp
- .sh 1 "INTRODUCTION"
- .lp
- The EXODUS Storage Manager
- is a multi-user object storage system
- supporting versions, indexes, single-site transactions,
- distributed transactions, concurrency control, and
- recovery.
- This document provides information about using version \*V of
- the EXODUS Storage Manager.
- Information about installing the Storage Manager can be found in
- the \fIEXODUS Storage Manager Installation Manual\fR.
- Section 2 gives an overview of the system.
- Section 3 discusses
- configuration facilities.
- Section 4 describes, in detail, the
- Storage Manager's application interface.
- Section 5 describes
- how to use the Storage Manager server.
- Appendices provide more details on certain aspects of the system.
- A table of contents is located at the end of the document.
- .br
- .sh 1 "OVERVIEW OF THE EXODUS STORAGE MANAGER"
- .lp
- This section, an executive summary,
- briefly describes the architecture of the
- Storage Manager and gives
- an overview of the facilities provided
- to applications,
- .lp
- Version \*V of the Storage Manager runs on the following architectures:
- Sun 4 (Sparc) (under SunOS 4.1.[23]),
- DecStation 3100/5000 (MIPS) (under Ultrix 4.2),
- and
- HP 720 (under HP-UX A.08.07).
- The Storage Manager is written in C++ and had been checked for
- compilation under the GNU C++ compiler (g++), version 2.3.3 and 2.4.5.
- .sh 2 "Architecture"
- .lp
- The EXODUS Storage Manager has a client-server architecture.
- An application program that uses the Storage Manager
- may reside on a machine different from the machine or machines
- on which the Storage Manager server or servers run.
- .(x z
- application
- .)x \*($n
- We use the term \fIapplication\fR to refer to programs
- that use the Storage Manager through the client programming
- interface described in
- Section 4.
- We use the term \fIclient library\fR, or
- .(x z
- client, client library
- .)x \*($n
- \fIclient\fR, to refer to the Storage Manager code and data structures
- that are linked into the application program to support the client
- programming
- interface.
- The client allows applications to use the facilities
- described in the next sub-section.
- Each client has its own buffer pool for caching data.
- The client library connects to
- one or more server processes
- and communicates with them using a remote-procedure-call-style
- mechanism that runs over TCP.
- .lp
- The Storage Manager server is a multi-threaded process providing
- asynchronous I/O, file, transaction, concurrency control, and recovery
- services to multiple clients.
- The server stores all data on \fIvolumes\fR,
- which are either Unix files or raw disk partitions.
- The server is more completely described in
- Section 5
- and in the \fIEXODUS Storage Manager Architecture Overview\fR [exoArch].
- .br
- .sh 2 "Facilities"
- .lp
- The EXODUS Storage Manager provides \fIobjects\fR for storing
- data, \fIversions\fR of objects, \fIfiles\fR for grouping related
- objects, and \fIindexes\fR for supporting efficient object access.
- The Storage Manager also provides \fIvolumes\fR,
- \fItransactions\fR, \fIconcurrency control\fR, \fIrecovery\fR, and
- \fIconfiguration options\fR. These facilities are presented briefly in
- this section, and more information can be found in later sections of
- the document.
- .sh 3 "Objects"
- .lp
- An object is an uninterpreted container of bytes,
- which can range in size from a few bytes to
- a little less than the size of a disk.
- Internally, the Storage Manager distinguishes two types of objects.
- There are \fIsmall objects\fR, which are objects that
- fit on a single disk page, and \fIlarge objects\fR, which are objects
- that do not fit on a single disk page.
- Support is also provided for creating and manipulating versions of
- both small and large objects.
- To provide a uniform function call interface, the distinction between
- small, large, and versioned objects is hidden from applications.
- Applications are unaware of whether they are dealing with
- a small or large object, and the same interface functions are
- called to manipulate either type of object.
- To simplify the task of manipulating very large objects,
- the Storage Manager provides flexible buffer management that
- allows variable-length pieces of large objects to be
- buffered contiguously in the client buffer pool.
- .lp
- Objects have object identifiers.
- The object identifier
- of a small object points directly to the object on disk, while the
- object identifier
- of a large object points to a \fIlarge object header\fR.
- The
- header of a large object serves as the root of a
- B\*[+\*]tree
- .(x z
- B\*[+\*]tree index
- .)x \*($n
- .(x z
- index, B\*[+\*]tree
- .)x \*($n
- index structure that is used to access the object's data
- [Care86, Care89].
- For space efficiency, a large object header can
- share a disk page with small objects and other large object headers.
- The data pages and the pages that make up the index structure of a
- large object are not shared, however. When a small object grows to the
- point where it can no longer be stored on a single page, the Storage
- Manager automatically converts it to a large object, leaving the new
- header in place of the original object.
- .lp
- The Storage Manager provides functions to read, overwrite, insert,
- delete, and append to an object.
- Read requests specify an object identifier and a range of bytes.
- The desired data is read into a contiguous
- region in the client buffer pool (even if is distributed over several
- disk pages), and a pointer to the data is returned to the caller.
- The overwrite function uses the pointer set up by a read request, and
- overwrites a subrange of the data.
- The insert and delete functions allow data to be inserted
- into and deleted from objects at arbitrary offsets, while the
- append function allows data to be appended to the end of an object.
- As mentioned earlier,
- large objects are represented using a
- B\*[+\*]tree
- index structure.
- This ensures that each of the above operations can be
- executed efficiently on large objects.
- .sh 3 "Versions"
- .lp
- A version of an object is another object that appears
- to be a copy of the original object.
- A version of a small object is a copy of the original object.
- A version of a large object is an object header with a pointer into
- the original object's data, until either the version or the original
- object is updated.
- When the large object version is updated, the affected portions of
- the original object are copied
- to prevent the original object from being affected by the
- update [Care89].
- Although the version support described here is
- primitive, essentially providing \*(lqcopy-on-write\*(rq objects, it has been
- purposefully designed that way so that a variety of
- application-specific versioning schemes can be implemented on top of
- the Storage Manager.
- .sh 3 "Files"
- .lp
- Objects are allocated in \fIfiles\fR, which are
- collections of related objects. Files have three uses.
- .lp
- First, files are used for clustering objects.
- The objects in a file are stored on disk pages allocated solely to that file,
- so files provide a way to physically co-locate related objects on the disk.
- .lp
- Second, the Storage Manager provides an efficient way to \fIscan\fR
- the objects in a file, visiting each object exactly once.
- .lp
- Third, the Storage Manager offers an efficient mechanism for
- loading the objects into a file in bulk.
- .sh 3 "Indexes"
- .lp
- The Storage Manager provides
- B\*[+\*]tree
- indexes and linear hashing indexes.
- .(x z
- index, linear hashed
- .)x \*($n
- .(x z
- linear hashed index
- .)x \*($n
- .(x z
- B\*[+\*]tree index
- .)x \*($n
- .(x z
- index, B\*[+\*]tree
- .)x \*($n
- Index keys can be any basic C language data type or strings.
- Values can be any type of fixed length.
- .sh 3 "Volumes"
- .lp
- User data and Storage Manager
- meta-data (objects, files, indexes, and logs)
- are stored on volumes.
- A volume represents a disk, although in fact it may
- be a Unix raw disk partition or a Unix file.
- .lp
- Volumes can be \fItemporary\fR, which means
- that data stored on them are not logged,
- and they do not persist from one transaction to the
- next.
- Temporary volumes are meant to provide fast
- storage for temporary data.
- .sh 3 "Transactions"
- .lp
- A transaction is a set of operations on objects, files, and indexes.
- Transactions are either committed or aborted.
- Updates made by committed transactions are guaranteed to be
- reflected on stable storage, even in the event of software or
- processor failure.
- Updates made by aborted transactions are not reflected on stable storage.
- .lp
- Transactions that use data on more than one server are
- committed using a distributed two-phase commit protocol [Moha83].
- .(x z
- two-phase commit protocol
- .)x \*($n
- .(x z
- transactions
- .)x \*($n
- .(x z
- transactions, distributed
- .)x \*($n
- .(x z
- distributed transactions
- .)x \*($n
- .sh 3 "Concurrency Control"
- .lp
- Concurrency control allow multiple client applications safely
- to use data simultaneously.
- Concurrency control is based on the standard hierarchical two-phase
- locking protocol providing degree-three consistency (see
- [Gray78, Gray88]).
- The lock hierarchy contains two granularities:
- file-level, and page-level.
- Locking for index operations is performed with a non-two-phase protocol,
- which allows multiple clients to read and update the same index.
- .lp
- Deadlocks involving more than one server are resolved through timeouts.
- .sh 3 "Recovery"
- .lp
- The Storage Manager recovers from software, operating system, and
- CPU failure by restoring data to a state
- in which all transactions have been committed or aborted.
- After an application fails, the
- transaction it is running is aborted by the servers
- that cooperated in the transaction.
- After a server fails and is restarted,
- updates made by committed transactions are restored,
- and updates by transactions in progress at the time of failure are undone.
- Recovery from media (disk) failure is not supported.
- .sh 3 "Configuration Options"
- .lp
- The Storage Manager client library and servers have
- \fIconfiguration options\fR, which can be set by users.
- These options control such things as
- parameters that affect performance and memory use,
- formats of volumes and logs,
- the choice of servers to be contacted by clients,
- and path names of installed executable files.
- .br
- .sh 2 "Illustration of Using the Storage Manager"
- .lp
- The purpose of this section is to give the reader
- a context in which to read the rest of this document.
- This section illustrates a way to get started using the Storage Manager.
- There are many ways to install, configure, and use the
- Storage Manager;
- only the simplest way is illustrated here.
- .lp
- This section uses an example application, \*(lqproducer-consumer\*(rq.
- The source code for the application programs is
- included in the Storage Manager software release, along with other
- example applications.
- .lp
- The producer program generates a series of transactions, each of
- which creates an object.
- The consumer program generates a series of transactions,
- each of which reads an object and destroys it.
- These programs were selected because they are relatively small, demonstrate
- the use of transactions, and show how to respond to
- server-initiated transaction failures and
- server failures.
- .lp
- The remainder of this section gives specific directions
- for starting a server and running the example
- program.
- Detailed explanations of the steps are not given here;
- all the details are given elsewhere in this document.
- .lp
- Installing the storage manager is akin to
- installing an operating system or a remote
- file system (but it's much simpler).
- You need to:
- .np
- install the system's executable code, libraries,
- and include files;
- .np
- prepare your disks for use;
- .np
- configure your server so that it will
- use your disks, and so that it is otherwise tailored
- for your use;
- .np
- compile and link your application programs to use
- the installed system;
- .np
- configure your application programs' environment,
- run the programs, and
- .np
- when you are finished, shut the system down.
- .br
- .sh 3 "Files Needed"
- .lp
- The following files are needed to use the Storage Manager:
- .np
- \fClibsm_client.a\fR, the Storage Manager client library,
- .np
- \fCsm_client.h\fR,
- the include file containing declarations of key data
- structures and constants,
- .np
- \fCsm_server\fR,
- the executable file for the server portion of the Storage Manager,
- .np
- \fCdiskrw\fR,
- the executable file for the disk I/O processes used by the server process,
- .np
- \fCformatvol\fR, and a utility program for formatting volumes,
- .np
- \&\fC.sm_config\fR, configuration files for a server, the formatter, and the
- application programs.
- One configuration file can be used for all programs, but it is
- sometimes easier to use configuration file for
- servers and the formatter, and
- another for applications.
- .lp
- These files can be installed anywhere; for the purpose of this
- section, we assume that they are all installed in your
- home directory, along with your application programs.
- (See the \fIEXODUS Storage Manager Installation Manual\fR
- to find the files in the Storage Manager software release.)
- .br
- .sh 3 "Preparing Your Disks"
- .lp
- The producer and consumer programs
- use a volume for storing their objects
- with a single server, and
- the server uses a log volume.
- The \fCformatvol\fR program is used to
- format a volume for use as either a data volume or a log volume.
- If you plan to use a raw disk partition for either volume,
- ask your system administrator for information
- on how to set up the device.
- .lp
- The formats of the volumes must be
- described in the configuration file that
- \fCformatvol\fR reads.
- In the directory in which you plan to run \fCformatvol\fR,
- create a file called \fC.sm_config\fR that looks something like this,
- with the appropriate substitutions:
- .(b
- .nf
- \fC
- formatvol*logformat: /path/to/logfile: 9000: 1: 1: 1000: 8
- formatvol*dataformat: /path/to/datafile: 8000: 1: 1: 300
- \fR
- .)b
- .lp
- Substitute the pathnames for files that you want to use
- for your log volume and data volume.
- With the options given above,
- the log volume
- will be given a volume identifier of 9000,
- and will consist of 1 cylinder of 1 track each, with 1000 blocks
- on each track, hence, 1000 blocks will be on the log.
- The log volume will use 8 Kbyte log pages.
- The data volume will be given a volume identifier of 8000,
- and will consist of 1 cylinder of 1 track each, with 300 blocks
- on each track, hence, 300 blocks will be on the data volume.
- .lp
- Now, run the formatter on volumes 9000 and 8000:
- .(b
- \fCformatvol -vol 9000 -vol 8000\fR
- .)b
- .lp
- If you would like to see the information written on the
- volumes' headers, do this:
- .(b
- \fCformatvol -dis 9000 -dis 8000\fR
- .)b
- .lp
- The formatter prints:
- .(b M
- \fCVOLID 9000, version 3, is a LOG volume
- BLOCK SIZES: 8 K slotted, 8 K lg data, 8 K lg hdr
- 8 K btree, 8 K idesc
- LAYOUT: 1000 blk/trk; 1 trk/cyl; 1 cyl
- 1000 total blocks of 8 KB for 8192.000 KB
- FREE: 0 free, 1000 used
- BITMAP: 1 blk each, freemap @ 2, slotmap @ 4, filemap @ 5
- UNIQUE: start @ 3
- LOG: start @ 7, ctl blk @ 6, blk sz 8 K, #blks 993
- end of log @ dismount: LSN w=0.o=0, LRC w=0.c=1
- VOLID 8000, version 3, is a DATA volume
- BLOCK SIZES: 8 K slotted, 8 K lg data, 8 K lg hdr
- 8 K btree, 8 K idesc
- LAYOUT: 300 blk/trk; 1 trk/cyl; 1 cyl
- 300 total blocks of 8 KB for 2457.600 KB
- FREE: 294 free, 6 used
- BITMAP: 1 blk each, freemap @ 2, slotmap @ 4, filemap @ 5
- UNIQUE: start @ 3
- \fR
- .)b
- .lp
- Now that you have formatted a log volume and a data volume,
- you are ready to start a server.
- .br
- .sh 3 "Configuring a Server"
- .lp
- Before you start a server, you need to create its configuration
- file.
- In the directory in which you expect to run the server,
- create a file called \fC.sm_config\fR that looks something like this,
- with the appropriate substitutions (in particular,
- for each occurrence of \fC/path/to\fR below):
- .(b
- .nf
- \fC
- server*bufpages: 500
- # Portname need not be identical to log volume id.
- # This is just a convenience.
- server*portname: 9000
- server*diskproc: /path/to/diskrw
- server*logformat: /path/to/logfile: 9000: 1: 1: 1000: 8
- server*dataformat: /path/to/datafile: 8000: 1: 1: 500
- server*logvolume: 9000
- \fR
- .)b
- .lp
- If the same configuration file is to be used for the formatter
- and the server, the format options can be made to be recognized
- by both:
- .(b
- \fC
- [sf]*[rl].logformat: /path/to/logfile: 9000: 1: 1: 1000: 8
- [sf]*[rl].dataformat:/path/to/datafile: 8000: 1: 1: 500
- \fR
- .)b
- .lp
- Now you can start the server.
- Open a window in which to run the server, and, in the
- directory containing the server and its configuration
- file, start the server:
- .(b
- \fCsm_server\fR
- .)b
- The server is started on a newly formatted log volume, so
- it automatically regenerates the log.
- The server prints
- .(b
- \fCServer is ready for requests.\fR
- .)b
- when it can serve applications.
- .br
- .sh 3 "Compiling and Linking Your Application"
- .lp
- An application program must include the header file
- \fCsm_client.h\fR, which, in turn includes
- \fC<stdio.h>\fR,
- \fC<setjmp.h>\fR,
- \fC<sys/types.h>\fR,
- \fC<netinet/in.h>\fR.
- Applications can be compiled with a C or C++ compiler.
- .lp
- The client library is compiled with C++,
- so client programs must be linked with a C++ compiler.
- See the \fIEXODUS Storage Manager Installation Manual\fR for
- more information.
- .br
- .sh 3 "Configuring and Running Your Application"
- .lp
- The programs need configuration options to determine where
- to find the server that manages the data volumes they use,
- and to determine the sizes of the buffer pools they will use.
- In the directory in which you expect to run the application programs,
- Create a file called \fC.sm_config\fR that looks something like this,
- with the appropriate substitutions:
- .(b
- .nf
- \fC
- # both producer and consumer will use
- # 250 page buffer pools:
- client*bufpages: 250
- # substitute the name or Internet address
- # of the host on which the server runs:
- client*mount: 8000 9000@serverhost
- \fR
- .)b
- Now you can run the producer and the consumer.
- It is easiest to create a window in which to run each
- program.
- The produce and consumer programs use the environment
- variable EVOLID to determine the what volume to use.
- EVOLID must be set in each window.
- .lp
- In window P:
- .(b
- \fC# producer <name> <#objects> <object size>
- setenv EVOLID 8000
- producer P 100 1000\fR
- .)b
- In window C:
- .(b
- .sp 1
- \fC# consumer <name> <#objects>
- setenv EVOLID 8000
- consumer C 100\fR
- .)b
- .lp
- The producer creates
- \*(lq#objects\*(rq objects and
- writes \*(lqname\*(rq in each one.
- The \*(lqobject size\*(rq argument is the size of each object.
- The consumer reads and destroys
- \*(lq#objects\*(rq objects.
- It prints the sizes of the objects and their names.
- The \*(lqname\*(rq given to the consumer program is
- immaterial, but is helpful for
- reading the output
- when running more than one consumer.
- .lp
- The two programs use a single root entry and a single file on the
- given volume.
- When a consumer
- has consumed the last object in a file, it destroys the file and
- removes the root entry.
- Each object is produced or consumed in a separate transaction.
- When both a producer and
- consumer are running concurrently,
- deadlocks occur periodically,
- since both are reading and writing the same file.
- When a deadlock occurs,
- the offending program aborts its transaction and
- tries again.
- Multiple producer and consumer programs may be started.
- If the server fails or shuts down, the producer and consumer
- programs attempt to reconnect every five seconds, and when
- successful, they continue transaction processing.
- .br
- .sh 3 "Shutting Down the Server"
- .lp
- In the window in which the server runs,
- type the command:
- .(b
- \fCshutdown\fR
- .)b
- .lp
- The server prints various messages, among them
- .(b
- \fC
- Clean shutdown: no recovery required on any volumes.
- All disk processes killed.
- \fR
- .)b
- when recovery is not required.
- .bp
- .sh 1 "CONFIGURATION OPTIONS AND CONFIGURATION FILES"
- .lp
- The client library, servers, and administrative
- programs use configuration options.
- All the options have a string name, a type,
- a set of possible values, a default value, and a current value.
- Client options can be set by a call to an application
- interface function or
- by a line in a \fIconfiguration file\fR.
- .(x z
- configuration options
- .)x \*($n
- .(x z
- configuration file
- .)x \*($n
- Server options can be set on the command line or by a line in the
- server's configuration file.
- .lp
- Configuration files are Unix files, and are similar in format
- to the X Window system's resource files.
- Each line in a configuration file is an
- \fIoption command\fR or a
- \fIcomment\fR.
- .lp
- A comment is a line that begins with \*(lq#\*(rq or with \*(lq!\*(rq.
- .lp
- An option command is a line containing an \fIoption
- descriptor\fR, white space, and a string representing a
- value to assign to the option.
- An option descriptor consists of
- an \fIoption prefix\fR
- followed immediately by an option name and a \*(lq:\*(rq.
- .lp
- The option prefix specifies the type and name of the program
- or programs for which the option is to be set.
- The program type is one of
- \*(lqclient\*(rq, \*(lqserver\*(rq, and \*(lqformatvol\*(rq.
- The program name is usually the file
- name of the program, without its path
- (an application program can override this).
- The program type and program name are separated by \*(lq.\*(rq.
- For example, the complete option descriptor for the option
- \*(lqbufpages\*(rq on the server named \fCserverA\fR is
- \fCserver.serverA.bufpages:\fR.
- .lp
- Wild card characters are allowed in
- the program type and name.
- The character \*(lq*\*(rq represents any portion of the prefix.
- The \*(lq?\*(rq character represents any program type or any program name.
- The expressions describing the program type and the program name
- are parsed by a regular expression handler,
- so complex expressions can be used.
- See the manual page for regex(3).
- .lp
- The names of options can be abbreviated,
- as long as the abbreviation unambiguously identifies a single option.
- (This is also true for options appearing on command lines.)
- Program types and names may not be abbreviated.
- Option name, program type, and program name matches are case-sensitive.
- .lp
- Configuration options of type Boolean
- can be set with the
- Boolean values TRUE or FALSE,
- or with the strings \*(lqyes\*(rq, \*(lqtrue\*(rq, \*(lqno\*(rq or \*(lqfalse\*(rq.
- The strings may be abbreviated and are not case-sensitive.
- .lp
- Each setting of an option overrides any
- previous value for that option.
- .lp
- Below, excerpts from configuration files illustrate
- ways to use the options.
- .(b I
- \fC
- # log volumes for two servers, whose executable
- # file names are serverA and serverB
- server.serverA.logvolume: 1000
- server.serverB.logvolume: 2000
- \fR
- .)b
- .(b I
- \fC
- # turn off progress printing for all servers
- server*progress: no
- # or
- server.?.progress: no
- \fR
- .)b
- .(b I
- \fC
- ! all servers and clients have a 1000 page buffer pool
- *bufpages: 1000
- # The application foo uses a 500 page buffer pool.
- # (overriding the value of 1000, above)
- client.foo.bufpages: 500
- # Applications beginning with the letter g use 400 pages
- client.g*.bufpages: 400
- \fR
- .)b
- .bp
- .sh 1 "THE STORAGE MANAGER APPLICATION INTERFACE"
- .lp
- The Storage Manager's application interface consists of a set of functions,
- macros, and variables.
- The Storage Manager software release contains the header file
- \fCsm_client.h\fR,
- in which are found the definitions for the macros and
- types that appear in this document.
- Function prototypes for the the Storage
- Manager functions are also found in \fCsm_client.h\fR.
- By convention, words that appear capitalized in the text
- are either C-preprocessor macros, or C- or C++- defined types,
- Functions definitions appear in bold face in the text.
- The rest of this section is divided into sub-sections
- describing error handling,
- initialization and shutdown,
- transactions,
- buffer management,
- operations on objects,
- operations on versions,
- operations on files,
- operations on indexes,
- miscellaneous macros, and
- administrative functions.
- .br
- .sh 2 "Handling Errors"
- .lp
- Error handling is important to
- users wishing to write robust client applications.
- We discuss it first, since most Storage Manager functions return error codes.
- Although this issue is complex, some of the burden is lightened by
- the recovery facilities of the Storage Manager.
- In this section we focus on error codes and error messages.
- .lp
- Almost all Storage Manager functions have integer return codes.
- .(x z
- error return codes, sm_errno
- .)x \*($n
- .(x z
- sm_errno
- .)x \*($n
- All functions (except those used in printing error messages) return either
- esmNOERROR (zero), which represents success, or esmFAILURE (negative
- one), which represents an error.
- When an error occurs, the global variable
- sm_errno contains an error code.
- A small positive error code is an error code returned by Unix,
- as defined in \fC<errno.h>\fR.
- An error code greater than 65,536 is an error returned by
- the Storage Manager, as defined in \fCsm_client.h\fR.
- The Storage Manager error codes have symbolic
- names (C preprocessor macros) that
- begin with \fIesm\fR.
- \fBThe value of sm_errno is not defined when
- the function returns esmNOERROR.\fR
- .lp
- Information about error codes can be obtained from the
- functions sm_Error(\ ), and sm_ErrorId(\ ),
- which are discussed below.
- .lp
- Some errors cause a message to be printed to
- the file addressed by \fCsm_ErrorStream\fR.
- By default,
- .(x z
- default error file for messages
- .)x \*($n
- this file is the standard error file, stderr, as defined in
- \fC<stdio.h>\fR,
- but the application can change it any time after the Storage
- Manager is initialized.
- .lp
- Errors differ in severity and have different side effects.
- The most severe errors are fatal and cause the application
- to exit (the client library calls \fIexit(3)\fR).
- When the application exits, the servers abort the transaction,
- if a transaction is active.
- Fatal errors are caused by internal software problems in the Storage Manager.
- An example of a fatal error is esmMALLOCFAILED, which occurs when the entire
- data segment has been allocated by the application and client library,
- and the Storage Manager cannot proceed.
- .lp
- Less severe errors
- cause the transaction to be aborted, but leave the application running.
- When this happens, sm_errno is given the value esmTRANSABORTED,
- .(x z
- esmTRANSABORTED
- .)x \*($n
- .(x z
- transaction aborted
- .)x \*($n
- and the client library also sets the global variable \fCsm_reason\fR.
- .(x z
- sm_reason
- .)x \*($n
- .(x z
- error return codes, sm_reason
- .)x \*($n
- The range of values for \fCsm_reason\fR
- is the same as the range of values for sm_errno.
- (The value of \fCsm_reason\fR is meaningful only if
- sm_errno has the value esmTRANSABORTED, and it
- is unpredictable and meaningless otherwise.)
- When the server or the client library aborts a transaction and
- returns esmTRANSABORTED to the application, the transaction is
- only partially aborted.
- The application \fBmust\fR complete the termination of the transaction by
- calling sm_AbortTransaction(\ ) (described in the
- Section 4.3.3, \fBTransaction Operations\fR).
- .lp
- Less severe errors are generated by incorrect arguments to client
- interface functions or the lack of resources, such as buffer space.
- The application can correct the problem and retry the operation
- without aborting the transaction.
- .lp
- Finally, some error codes indicate conditions that are not errors
- at all, such as esmEMPTYFILE, which is returned when an empty file
- is read.
- .lp
- The following two functions can be used to
- print more information about the error.
- .sp
- .(b L
- \fBchar *sm_Error (errorCode)
- int errorCode; /* error code returned by an sm function /*\fR
- .)b
- .(b L
- \fBchar *sm_ErrorId (errorCode)
- int errorCode; /* error code returned by an sm function /*\fR
- .)b
- .lp
- These are the only Storage Manager functions that do not return an integer.
- When a client library function returns an error,
- sm_Error(\ ) can be called by the application
- to get a string that provides a brief description of the error.
- It also provides descriptions of Unix error codes.
- Sm_ErrorId(\ ) is used to return the string representation of
- the error code.
- For example, the call sm_ErrorId(esmBADOID) returns the string \*(lqesmBADOID\*(rq,
- and the call sm_Error(esmBADOID) returns the string \*(lqinvalid object id.\*(rq
- .lp
- If the client is disconnected from a server (by a server crash,
- network failure, etc.) the client library tries to reconnect
- to the server the next time it issues a request to the server.
- If the server in question is not available,
- the Storage Manager returns an error such as esmSERVERDIED or
- a Unix error such as ECONNREFUSED.
- While the server in question is doing recovery after a restart,
- esmTRANSDISABLED is returned.
- The server responds to requests when recovery is completed.
- .br
- .sh 2 "Initialization and Shutdown Operations"
- .lp
- Initialization and shutdown functions are used at the
- beginning and end of an application program,
- but most of them can be called at any time.
- The pertinent functions are
- sm_SetClientOption(\ ),
- sm_GetClientOption(\ ),
- sm_ParseCommandLine(\ ),
- sm_ReadConfigFile(\ ),
- sm_Initialize(\ ),
- and
- sm_ShutDown(\ ).
- .lp
- Before initializing the Storage Manager client with sm_Initialize(\ ),
- a number of client configuration options must be set by the application.
- .(x z
- configuration options
- .)x \*($n
- Options can be set through calls to sm_SetClientOption(\ ),
- sm_ParseCommandLine(\ ), or sm_ReadConfigFile(\ ).
- These options are summarized in Table 1.
- See
- Section 3 for information that applies to all options.
- .(b
- .TS
- box, center, tab(;);
- c|c|c|c|c
- c|c|c|c|c
- l|l|l|l|l.
- Option;Option;Possible;Default;Option
- Name;Type;Values;Values;Description
- _
- bufpages;int;> 4;none;# pages in the buffer pool
- groups;int;> 3;20;# buffer groups
- userdesc;int;> 0;2000;# user descriptors
- mount;string;volid port@host;none;where to find server
- ;;;;for this volume
- lognewpages;Boolean;yes,no,true,false;no/false;client logs new pages
- deallocpages;Boolean;yes,no,true,false;yes/true;removes empty pages
- pagelock;string;SH,EX;SH;default lock for pages
- traceflags;int;>= 0;0;set tracing flags
- locktimeout;int;>= 0;30;# 10-second intervals
- ;;;;willing to await a lock
- .TE
- .ce
- .uh "Table 1: Client Options"
- .(x z
- options, client
- .)x \*($n
- .)b
- .lp
- The \*(lqbufpages\*(rq option
- sets the size of the client buffer pool in 8 Kbyte pages
- (or \fIn\fR byte pages, for \fIn\fR=MIN_PAGESIZE;
- MIN_PAGESIZE is defined in \fCsm_client.h\fR).
- See
- Section 4.11.3, \fBTuning the Application\fR
- for more information about setting this option.
- .lp
- The \*(lqgroups\*(rq option sets the limit on the number of buffer groups that can
- be opened at once.
- The default value is 20.
- See
- Section 4.6, \fBBuffer Operations\fR,
- for more information about buffer groups.
- .lp
- The \*(lquserdescs\*(rq option sets the limit on the number of open user descriptors.
- The number of user descriptors should be set to the
- maximum number of simultaneous object
- references that are expected by the application program.
- The default value is 2000.
- See
- Section 4.7, \fBOperations on Objects\fR,
- for more information about user descriptors.
- .lp
- The \*(lqlognewpages\*(rq option, if \*(lqyes\*(rq, causes the client to generate
- log pages for newly allocated pages, and if \*(lqno\*(rq, causes the
- server to generate the log pages.
- Setting this option to \*(lqno\*(rq results in fewer
- log records shipped to servers and usually
- lowers log space requirements for transactions that create objects.
- With rare patterns of use, setting
- \*(lqlognewpages\*(rq to \*(lqyes\*(rq results in better performance:
- if the objects that cause new pages to be allocated are small,
- and if enough work is done between object-creation operations to cause
- the newly allocated pages to be swapped,
- the preferred value for \*(lqlognewpages\*(rq is \*(lqyes\*(rq.
- In general, it is difficult to predict which objects will be
- be created on newly allocated pages.
- The \*(lqlognewpages\*(rq option may be set only when a transaction is not active.
- .lp
- The \*(lqdeallocpages\*(rq option, if \*(lqyes\*(rq, causes the client to deallocate
- pages that become empty after objects are destroyed.
- If the option's value is \*(lqno\*(rq, these pages remain in the file,
- and do not get used again unless an appropriate \fInear-hint\fR
- .(x z
- near-hint
- .)x \*($n
- .(x z
- hint, near-
- .)x \*($n
- is given when an object is subsequently created.
- Under most circumstances, the preferred value of \*(lqdeallocpages\*(rq
- is \*(lqyes\*(rq.
- If objects are created and destroyed in a LIFO fashion, and
- if the near-hint for object creation is NEAR_LAST,
- the preferred value is \*(lqno\*(rq.
- .lp
- The \*(lqpagelock\*(rq option changes the default
- lock mode for pages.
- See the
- Section 4.2, \fBInitialization and Shutdown Operations\fR,
- and Appendix A, \fBLocking Protocol for Storage Manager Operations\fR
- for information about using options.
- .lp
- The \*(lqtraceflags\*(rq option is used to turn on tracing, and is only
- available in a Storage Manager that was compiled with -DDEBUG.
- The \*(lqtraceflags\*(rq option takes effect immediately and can be set at any time.
- .lp
- The \*(lqmount\*(rq options indicate the locations of the volumes that the
- applications use.
- The \*(lqmount\*(rq option may be used more than once,
- to \fIadd\fR new volumes to the
- client library's set of usable volumes,
- or to \fIchange\fR the location of a volume.
- The option value consists of a volume's integer identifier,
- an Internet address, and a port at which
- can be found a server that manages the volume.
- The Internet addresses and port have format
- \fIport @ host\fR,
- where both the port and the host can be numeric or symbolic.
- Symbolic port names must be
- found in the services database used by \fIgetservbyname(3n)\fR,
- and symbolic host names must be in the
- host name database used by \fIgethostbyname(3n)\fR.
- The following example shows three values for the
- \*(lqmount\*(rq option that
- accomplish the same thing in three ways.
- The volume 1000 is managed by the
- server listening on port 1152
- (which
- is called \*(lqbounty\*(rq in the \fC/etc/services\fR
- database) on the local machine,
- whose Internet address is 128.105.2.153, also known as
- \*(lqpitcairn.isle.edu\*(rq to the host-name server.
- .(l
- \fC1000 1152@128.105.2.153\fR
- \fC1000 bounty@pitcairn.isle.edu\fR
- \fC1000 1152@pitcairn.isle.edu\fR.
- and
- \fC1000 bounty@128.105.2.153\fR
- .)l
- .lp
- The host name \fIlocalhost\fR \fBdoes not work\fR
- if you are using distributed transactions (multiple
- cooperating servers).
- .lp
- Volume identifiers \fBmust identify volumes
- unambiguously, across all servers.\fR
- .lp
- For each application or client,
- \fBall the host names used for a given server
- must resolve to the same Internet address\fR.
- Using the above example,
- this means that
- \*(lq128.105.2.153\*(rq
- and
- \*(lqpitcairn.isle.edu\*(rq
- are interchangeable.
- \*(lqLocalhost\*(rq,
- which resolves to the Internet address 127.0.0.1,
- is not interchangeable with
- \*(lq128.105.2.153\*(rq
- or
- \*(lqpitcairn.isle.edu\*(rq,
- even though it addresses the same machine when
- used by a client on
- \*(lqpitcairn.isle.edu\*(rq.
- .lp
- It is acceptable to use two \fIdifferent\fR servers running
- on a machine, by addressing them at different \fIports\fR.
- This means that
- .(l
- \fC1000 1151@pitcairn.isle.edu\fR
- and
- \fC2000 1152@pitcairn.isle.edu\fR
- .)l
- can serve an application.
- .lp
- The \*(lqlocktimeout\*(rq option
- limits the time the server waits to acquire a lock on behalf
- of the client.
- The value represents a number of 10-second intervals.
- A value of zero means that the server does not wait at all,
- and if the lock cannot be acquired immediately, the
- client operation returns esmFAILURE, with
- esmLOCKBUSY in sm_errno.
- The option value can be changed at any time.
- The value that is in effect at the time a transaction
- makes its first request to a server
- is the value used for lock requests on that server
- for the duration of the transaction.
- See Appendix A,
- Section A.3, \fBDeadlock Detection and Avoidance\fR,
- for more information about locks.
- See also
- Section 4.4, \fBMounting and Dismounting Volumes\fR,
- for information concerning the protocol between clients and
- servers.
- .lp
- To support code that was written before the configuration option
- .(x z
- configuration options
- .)x \*($n
- facility was added, the client library looks for the
- environment variable ESMCONFIG.
- If set, ESMCONFIG indicates a configuration file to read.
- .(x z
- configuration file
- .)x \*($n
- The file is read using sm_ReadConfigFile(\ ),
- with its \*(lqprogramName\*(rq argument having the value NULL.
- It is read before any option is set, so all other functions that set
- options override those found in the ESMCONFIG file.
- .sp
- .(b L
- \fBsm_SetClientOption (optionName, optionValue, valueType)
- char *optionName; /* IN name of the option to set */
- void *optionValue; /* IN new value for the option */
- SMDATATYPE valueType; /* IN type of optionValue */\fR
- .)b
- .(x z
- sm_SetClientOption
- .)x \*($n
- Sm_SetClientOption(\ ) sets the option named \*(lqoptionName\*(rq to the
- value in \*(lqoptionValue\*(rq.
- The \*(lqvalueType\*(rq arguments indicates the
- type addressed by \*(lqoptionValue\*(rq.
- The supported types
- are SM_int, SM_Boolean, and SM_string.
- If \*(lqvalueType\*(rq matches the type of the
- option as specified in Table 1, a simple assignment is done.
- If \*(lqvalueType\*(rq is SM_string and the option has a different type,
- a conversion is performed.
- .sp
- .(b L
- \fBsm_GetClientOption (optionName, optionValue)
- char *optionName; /* IN name of the option to get */
- void *optionValue; /* OUT value for the option */\fR
- .)b
- .(x z
- sm_GetClientOption
- .)x \*($n
- Sm_GetClientOption(\ ) retrieves the value for \*(lqoptionName\*(rq
- and returns it in \*(lqoptionValue\*(rq. It is assumed that the location
- addressed by \*(lqoptionValue\*(rq
- matches the type, found in Table 1, for the option.
- For string-type options, the argument \*(lqoptionValue\*(rq is
- treated as type \*(lqconst char **\(*rq. That is, it should contain the
- address of a pointer variable that is updated to point to a
- read-only buffer containing the option value.
- .sp
- .(b L
- \fBsm_ParseCommandLine (argc, argv, errorMsg)
- int *argc; /* IN/OUT number of command line arguments */
- char **argv; /* IN/OUT command line arguments */
- char **errorMsg; /* OUT syntax error message */\fR
- .)b
- .(x z
- sm_ParseCommandLine
- .)x \*($n
- Sm_ParseCommandLine(\ ) searches the command line, \*(lqargv\*(rq, for any
- client options. Command-line options are
- prefixed by a \*(lq-\*(rq. The value for the option must follow the
- option name.
- The Storage Manager ignores
- any command-line argument that is not recognized as a
- Storage Manager client option.
- If a client option is found, the name and value are removed
- from \*(lqargv\*(rq and \*(lqargc\*(rq is decremented by 2, even if there is
- an error in the option such as being given an illegal value.
- If there is an error processing any option, \*(lqerrorMsg\*(rq is
- changed to point to an error message string.
- .sp
- .(b L
- \fBsm_ReadConfigFile (configFile, programName, errorMsg)
- char *configFile; /* IN name of the configuration file */
- char *programName; /* IN name of the application */
- char **errorMsg; /* OUT syntax error message */\fR
- .)b
- .lp
- Sm_ReadConfigFile(\ ) reads the option configuration file
- .(x z
- sm_ReadConfigFile
- .)x \*($n
- \*(lqconfigFile\*(rq, and sets the options indicated.
- If \*(lqconfigFile\*(rq is NULL, the default configuration
- files \fC/usr/lib/exodus/sm_config\fR, \fC$HOME/.sm_config\fR,
- and \fC./.sm_config \fR are read in that order, if they exist.
- The name of the default configuration file \fC/usr/lib/exodus/sm_config\fR
- .(x z
- default configuration files
- .)x \*($n
- .(x z
- configuration files, default
- .)x \*($n
- can be changed
- with a minor Storage Manager source code change
- described in the installation manual,
- \fIEXODUS Storage Manager Installation Manual\fR.
- The \*(lqprogramName\*(rq option gives the program name for matching with
- options in the configuration file.
- If \*(lqprogramName\*(rq is NULL and a previous call to sm_ReadConfigFile(\ )
- had a non-NULL \*(lqprogramName\*(rq, the previous \*(lqprogramName\*(rq is used.
- If no previous call was made and a \*(lqprogramName\*(rq is not given,
- configuration file lines that contain a program name are not used;
- only generic entries, such as \fCclient.bufpages: 1000\fR and
- \fCclient*bufpages: 1000\fR are used.
- .lp
- When an error occurs while reading the file, an error message is
- stored in \*(lqerrorMsg\*(rq and esmFAILURE is returned, as with other
- Storage Manager functions.
- The \*(lqerrorMsg\*(rq is describes syntax related errors in the configuration file.
- .lp
- See
- Section 3 for information about the format of configuration files.
- .sp
- .(b L
- \fBsm_Initialize (\ )
- .)b
- .(x z
- sm_Initialize(\ )
- .)x \*($n
- Sm_Initialize(\ ) initializes the Storage Manager's data structures.
- No Storage Manager functions except option and configuration file functions may be called
- .(x z
- configuration file
- .)x \*($n
- before sm_Initialize(\ ) is called.
- Options that do not have defaults
- must be set before sm_Initialize(\ ) is called, otherwise
- esmFAILURE is returned, sm_errno is set to indicate what
- the problem is.
- .sp
- .(b L
- \fBsm_ShutDown (\ )\fR
- .)b
- Sm_ShutDown(\ )
- .(x z
- sm_ShutDown(\ )
- .)x \*($n
- closes all the open buffer groups and
- frees the memory allocated at run-time
- by the client library.
- Once the client library has been shut down,
- it can used again by calling sm_Initialize(\ ).
- The client library loses track the information
- in the \*(lqmount\*(rq client options, so
- if sm_Initialize(\ ) is to be used again, the
- configuration files must be reread or the mount
- options must be reset with sm_SetClientOption(\ ).
- .lp
- Figure 2 shows a simple \*(lqhello world\*(rq application for the Storage
- Manager.
- It sets configuration options, initializes the client library,
- .(x z
- configuration options
- .)x \*($n
- and shuts down the client library.
- A more complete program would, begin transactions,
- perform operations on objects, files, and indexes.
- More sample programs are included with the software release.
- .(z I
- .sz -3
- \fC/*
- * "Hello world" program: demonstrates initialization and shutdown.
- */
- #include <stdlib.h>
- #include "sm_client.h"
-
- void ErrorCheck (int, char *);
-
- main(int argc, char** argv) {
- int e;
- char *errorMsg;
-
- e = sm_ReadConfigFile(NULL, argv[0], &errorMsg);
- if (e != esmNOERROR) {
- fprintf(stderr, "Configuration file error: %s", errorMsg);
- ErrorCheck(e, "sm_ReadConfigFile");
- exit(0);
- }
- e = sm_ParseCommandLine(&argc, argv, &errorMsg);
- if (e != esmNOERROR) {
- fprintf(stderr, "Command line error: %s", errorMsg);
- ErrorCheck(e, "sm_ParseCommandLine");
- exit(0);
- }
-
- e = sm_Initialize(\ ); ErrorCheck(e, "sm_Initialize");
- printf("Hello world!");
- e = sm_ShutDown(\ ); ErrorCheck(e, "sm_ShutDown");
- }
-
- void ErrorCheck (int e, char *func) {
- if (e < 0) {
- fprintf(stderr, "Storage Manager error \e"%s\e" in %s",
- sm_Error(sm_errno), func);
- exit(1);
- }
- }\fR
- .sz +3
- .ce
- .uh "Figure 2: Example Program"
- .)z
- .br
- .sh 2 "Transactions"
- .lp
- The Storage Manager supports transactions,
- including concurrency control and recovery.
- Transactions may involve data managed by several Exodus Storage Manager
- servers, in which case a two-phase commit protocol, based on
- .(x z
- presumed abort
- .)x \*($n
- Presumed Abort [Moha83],
- determines the fate of the transaction when the
- application commits the transaction.
- The fact that such a transaction is distributed over several servers
- .(x z
- transactions, distributed
- .)x \*($n
- .(x z
- distributed transactions
- .)x \*($n
- is invisible to the application.
- On the other hand,
- the Storage Manager (server or servers)
- can cooperate in a two-phase commit procedure
- with other transaction processing systems when the
- external two-phase commit functions are used.
- The external two-phase commit functions also can be used explicitly
- to invoke the two phases for a transaction
- that involves only Exodus Storage manager servers.
- The external two-phase commit functions are described
- under \*(lqAdvanced Topics\*(rq, in
- Section 4.11.3, \fBExternal Two-Phase Commit Functions\fR,
- .lp
- Object, file, index, and
- root entry operations must be performed within the scope of a
- transaction, or an error is returned.
- An application can run no more than one transaction at a time.
- Transactions cannot be nested, suspended, or resumed.
- .lp
- In order to guarantee the semantics of transactions,
- operations on objects and files acquire \fIlocks\fR.
- .(x z
- locks
- .)x \*($n
- Appendix A describes the kinds of locks acquired by
- the client library functions.
- .br
- .sh 3 "Transaction Identifiers"
- .lp
- Each transaction has a local transaction identifier, which is
- assigned by the Storage Manager.
- The data type TID represents a transaction identifier.
- .(x z
- transaction identifier
- .)x \*($n
- .(x z
- transaction identifier, local
- .)x \*($n
- The application can treat a TID as an opaque value.
- The Storage Manager maintains a global variable, Tid, of type TID,
- which value the application can inspect, but had better not modify.
- .lp
- The application can use the following two macros
- to give an initial value
- to a transaction identifier,
- and to recognize that value.
- .(b I
- \fBINVALIDATE_TID (TID tid)\fR
- .)b
- .lp
- sets the \*(lqtid\*(rq argument to an invalid transaction identifier.
- .(b I
- \fBTID_IS_INVALID (TID tid)\fR
- .)b
- .lp
- returns TRUE if \*(lqtid\*(rq is the value given by
- INVALIDATE_TID(\ ), FALSE if not.
- TID_IS_INVALID(\ ) does not tell if there is an active transaction
- with the given transaction identifier.
- .br
- .sh 3 "Transaction States"
- .lp
- An application is always in one the following states:
- not running a transaction (INACTIVE),
- running a transaction (ACTIVE),
- running a transaction that has been (partially) aborted (ABORTED).
- .lp
- An application is in the INACTIVE state until it calls
- sm_BeginTransaction(\ ), and
- after a call to sm_CommitTransaction(\ ) or sm_AbortTransaction(\ ).
- .lp
- If the Storage Manager server or client library aborts a transaction,
- which sometimes happens because of an error on the part of the application,
- the application is in the ABORTED state until a call to sm_AbortTransaction(\ ).
- While in the ABORTED state, a call to any function other than sm_AbortTransaction(\ ) returns the
- .(x z
- esmTRANSABORTED
- .)x \*($n
- error esmTRANSABORTED.
- .br
- .sh 3 "Transaction Operations"
- .sp
- .(b L
- \fBsm_BeginTransaction (tid)
- TID *tid; /* OUT transaction ID */\fR
- .)b
- .(x z
- sm_BeginTransaction(\ )
- .)x \*($n
- Sm_BeginTransaction(\ ) is called at the beginning of a transaction.
- The argument \*(lqtid\*(rq corresponds to a transaction identifier and is
- assigned by the Storage Manager.
- .lp
- Sm_BeginTransaction(\ ) \fBdoes not\fR
- contact any servers or
- initiate a transaction with any server,
- since the operation has no arguments to indicate
- which servers are of interest.
- It only begins a transaction
- \*(lqlocally\*(rq.
- Once a transaction has begun locally,
- the client library initiates transactions
- on servers when data references so require.
- .sp
- .(b L
- \fBsm_CommitTransaction (tid)
- TID tid; /* IN transaction ID */\fR
- .)b
- .(x z
- sm_CommitTransaction(\ )
- .)x \*($n
- Sm_CommitTransaction(\ ) is called to commit the effects of a
- transaction.
- If the commit
- succeeds, all changes made to data since the beginning of the
- transaction are guaranteed to be persistent,
- even in the event of system failure.
- See
- Section 4.9.1, \fBConsistency Guarantees for Files\fR,
- for more information about this guarantee.
- If the commit fails, an error is returned, and the
- transaction is aborted.
- When a transaction is committed, all user descriptors (see sm_ReadObject(\ ) )
- are released.
- Buffer groups attached to the transaction
- (see sm_OpenBufferGroup(\ ) ) are closed.
- .sp
- .(b L
- \fBsm_AbortTransaction (tid)
- TID tid; /* IN transaction ID */\fR
- .)b
- .(x z
- sm_AbortTransaction(\ )
- .)x \*($n
- Sm_AbortTransaction(\ ) aborts a transaction.
- Sm_AbortTransaction(\ ) releases
- all the user descriptors that were created during
- the transaction (see sm_ReadObject(\ ) ).
- Buffer groups attached to the
- transaction (see sm_OpenBufferGroup(\ ) ) are closed.
- .lp
- The persistent data appear as if the transaction never began.
- The execution state of the application program is not affected
- by calling sm_AbortTransaction(\ ).
- The result is that the transient data in the program's address
- space do not match the state of the persistent data.
- The problem can be alleviated to some degree by judicious
- use of \fIsetjmp(2)\fR, \fIlongjmp(2)\fR, and lexical scoping
- in the application program.
- The following macros, which are defined in \fCsm_client.h\fR,
- do that:
- .sp
- .(b L
- \fBSM_BEGIN_TRANSACTION (tid, abortCode)
- TID *tid; /* transaction ID */
- int abortCode; /* location to store abort code */\fR
- .)b
- SM_BEGIN_TRANSACTION begins a transaction block (i.e. it opens a
- new lexical scope in C or C++). The transaction
- ID is placed in \*(lqtid\*(rq. The argument \*(lqabortCode\*(rq \fBmust\fR be a
- variable. This variable can be checked at the end of the transaction
- to determined if it was aborted.
- .sp
- .(b L
- \fBSM_COMMIT_TRANSACTION (tid)
- TID tid; /* transaction ID */\fR
- .)b
- SM_COMMIT_TRANSACTION ends a transaction block.
- When this statement is executed, the transaction is committed,
- assuming no error occurs during commit.
- Immediately after the SM_COMMIT_TRANSACTION statement, the \*(lqabortCode\*(rq variable given
- in the SM_BEGIN_TRANSACTION statement should be checked to see
- if any error occurred.
- If no error occurred, \*(lqabortCode\*(rq is set to esmNOERROR.
- Otherwise, \*(lqabortCode\*(rq is set to the value given in SM_ABORT_TRANSACTION.
- .sp
- .(b L
- \fBSM_ABORT_TRANSACTION (abortCode)
- int abortCode; /* error to return on abort */\fR
- .)b
- SM_ABORT_TRANSACTION aborts the active transaction
- (i.e. sm_AbortTransaction(\ ) is called)
- and resumes execution at the line
- immediately following the SM_COMMIT_TRANSACTION statement for the transaction.
- The SM_ABORT_TRANSACTION macro does not need to be called within the
- lexical scope of the transaction block.
- It can be called in any function operating in
- the dynamic scope of the transaction.
- The \*(lqabortCode\*(rq argument sets the \*(lqabortCode\*(rq variable
- given in SM_BEGIN_TRANSACTION.
- .lp
- When a SM_ABORT_TRANSACTION is called, the program's control is
- transferred to the program point after the SM_COMMIT_TRANSACTION statement.
- The stack pointer is restored to the level of the transaction block,
- so functions on the program's stack after it are not completed.
- \fBFor C++, this means that destructors are not called for any local variables
- in those functions.\fR
- .lp
- Examples of using both the transaction macros and functions can
- be found in the producer-consumer example given in the
- Storage Manager software release.
- .br
- .sh 2 "Mounting and Dismounting Volumes"
- .lp
- An application program \fBdoes not\fR need to mount and
- dismount volumes explicitly.
- In most cases,
- the client library automatically mounts a volume
- when the application makes its first reference to that volume.
- An application that does not explicitly mount a volume
- may, when it
- performs its first operation on an object,
- find that the server for that object is not
- running.
- Writing programs to handle such
- common errors can be difficult,
- so it may be more convenient to
- mount volumes before proceeding with
- operations on data.
- Sm_MountVolume(\ ) serves that purpose.
- If that server
- has not yet been contacted, sm_MountVolume(\ )
- establishes a connection to the server and mounts
- the volume.
- It does not begin a transaction.
- (See Section 4.3.3, \fBTransaction Operations\fR to
- understand how transactions are begun.)
- .lp
- When an application exits or calls sm_ShutDown(\ ),
- connections to servers are severed, and the servers
- dismount the volumes used by the application.
- A server severs its connections and dismounts the
- volumes if an application is \fIinactive\fR for
- a significant time.
- An application is inactive if it has no transaction running.
- .lp
- An application can dismount volumes explicitly,
- causing the volumes to be dismounted at the server.
- An application that continues to run after it is
- finished using the Storage Manager would do well
- to use sm_ShutDown(\ ).
- If it is inappropriate to use sm_ShutDown(\ ),
- but such an application is finished with
- a set of volumes, it would do best to
- dismount the volumes,
- particularly if the volumes are likely
- to be reformatted.
- .sp
- .(b L
- \fBsm_MountVolume ( volid )
- VOLID volid; /* IN volume to mount */
- .)b
- .(x z
- sm_MountVolume(\ )
- .)x \*($n
- .lp
- Sm_MountVolume(\ ) causes the volume identified
- by \*(lqvolid\*(rq to be mounted.
- A side effect of the operation is that the
- client library has established a connection with
- the server that manages this volume.
- .lp
- If the volume cannot be mounted, sm_MountVolume(\ ) returns
- esmFAILURE and a value in sm_errno that describes the reason:
- esmNOSUCHVOLUME (the client library cannot identify
- the server for this volume because there is no \*(lqmount\*(rq
- option for this volid),
- esmTRANSABORTED (the transaction was aborted during the previous
- operation, and the next thing the application must do is
- abort the transaction),
- esmSERVERDIED (connection with server was severed during the
- mount operation),
- or any Unix error message from \fC<errno.h>\fR (such
- as ENETDOWN and ECONNREFUSED), which indicate that
- the server is not running or is unreachable through the network.
- .sp
- .(b L
- \fBsm_DismountVolume ( volid )
- VOLID volid; /* IN volume to dismount */
- .)b
- .(x z
- sm_DismountVolume(\ )
- .)x \*($n
- .lp
- The \*(lqvolid\*(rq argument identifies the volume
- to be dismounted.
- If the volume is not mounted, the operation
- returns esmFAILURE, and
- the client library returns esmBADVOLID in sm_errno.
- .sh 2 "Root Entries"
- .lp
- The root entry facility is designed for applications to get a handle
- to data on a volume. \**
- .(f
- \** Root entries cannot be created on temporary volumes.
- .)f
- A common use of a root entry is to associate a string name with
- an object identifier for an object containing information about
- the contents of the volume.
- For example, in a database system, this might be the
- object identifier for the catalog.
- .lp
- A root entry is a string and data pair stored in a special location on a
- volume, called the root area.
- The string, called the name, is used to identify the entry.
- The name string must be null-terminated.
- The maximum lengths of the name (including the terminating null)
- and data are defined by
- MAX_ROOTNAME_SIZE and MAX_ROOTDATA_SIZE respectively.
- An error is returned if the available number of root entries is exceeded.
- Names and data are limited to 32 bytes each,
- and approximately 90 root entries can reside in a volume's root area.
- .sp
- .(b L
- \fBsm_SetRootEntry (volid, name, data, dataLength)
- VOLID volid; /* IN volume identifier */
- char *name; /* IN name to store data entry under */
- void *data; /* IN data entry to be stored */
- int dataLength; /* IN length of the data */
- .)b
- .(x z
- sm_SetRootEntry(\ )
- .)x \*($n
- .lp
- Sm_SetRootEntry(\ ) is creates or updates an entry.
- The \*(lqname\*(rq argument is the name of the entry and the \*(lqdata\*(rq
- argument is the data to be stored.
- The number of bytes in the data is given in \*(lqdataLength\*(rq.
- For example, to store
- the contents of the variable \*(lqrootOid\*(rq under the name \*(lqroot-obj\*(rq, use
- \fCsm_SetRootEntry(volid,
- \*(lqroot-obj\*(rq, (char*) &rootOid, sizeof(rootOid))\fR.
- .lp
- Sm_SetRootEntry(\ ) obtains an exclusive
- .(x z
- lock, exclusive
- .)x \*($n
- lock on the root area of the volume, so
- updates to root entries should be performed in a short transaction.
- .sp
- .(b L
- \fBsm_GetRootEntry (volid, name, data, dataLength)
- VOLID volid; /* IN volume identifier */
- char *name; /* IN name of the entry */
- void *data; /* OUT data stored under name */
- int *dataLength; /* IN/OUT length of the data */
- .)b
- .(x z
- sm_GetRootEntry(\ )
- .)x \*($n
- Sm_GetRootEntry(\ ) retrieves the root entry named \*(lqname\*(rq. The data is
- placed in \*(lqdata\*(rq and the length of the data is returned in
- \*(lqdataLength\*(rq. If \*(lqdataLength\*(rq is initialized with a value greater than
- or equal to zero, the maximum number of bytes copied to \*(lqdata\*(rq
- is \*(lqdataLength\*(rq. If \*(lqdataLength\*(rq is initialized with a value less
- than zero, the entire length of the data is copied to \*(lqdata\*(rq.
- .lp
- Sm_GetRootEntry(\ ) obtains a share lock on the root area of the volume.
- This share lock blocks other
- .(x z
- lock, share
- .)x \*($n
- transactions from updating or removing root entries
- until the transaction is committed or aborted.
- If no root entry exists for \*(lqname\*(rq, esmFAILURE
- is returned and sm_errno is set to esmBADROOTNAME.
- .sp
- .(b L
- \fBsm_RemoveRootEntry (volid, name)
- VOLID volid; /* IN volume identifier */
- char *name; /* IN name of entry */
- .)b
- .(x z
- sm_RemoveRootEntry(\ )
- .)x \*($n
- .lp
- Sm_RemoveRootEntry(\ ) removes the root entry stored under \*(lqname\*(rq.
- Sm_RemoveRootEntry(\ ) obtains an exclusive
- .(x z
- lock, exclusive
- .)x \*($n
- lock on the root area of the volume, so
- removal of root entries should be performed in a short transaction.
- .br
- .sh 2 "Buffer Operations"
- .lp
- The Storage Manager buffer manager implements the concept of a
- \fIbuffer group\fR, as proposed in the DBMIN buffer management
- .(x z
- buffer group
- .)x \*($n
- algorithm [Chou85].
- The essence of the DBMIN algorithm is that
- competing uses of the buffer pool may be
- allocated their own buffers, to minimize competition for
- the buffers and to eliminate thrashing in the buffer pool.
- .lp
- All uses of the buffer pool are made through a buffer group.
- A buffer group is a container of page buffers, with a limit
- on the number of \fIfixed\fR pages it can contain.
- .(x z
- pages, fixed
- .)x \*($n
- .(x z
- fixed pages
- .)x \*($n
- Fixed pages are guaranteed to remain in the buffer pool
- until they are \fIunfixed\fR.
- .(x z
- unfixed pages
- .)x \*($n
- .(x z
- pages, unfixed
- .)x \*($n
- Their locations (virtual addresses) may change, but
- the pages remain in the virtual address space
- of the buffer pool.
- Each buffer group has a replacement policy, which
- controls the replacement of unfixed pages within the buffer group.
- .lp
- Buffer groups can be opened and closed at any time,
- whether or not a transaction is running.
- If a buffer group is opened in a transaction, it may be
- \*(lqattached\*(rq to the transaction, which means that the
- buffer group is closed by the client library
- when the transaction ends.
- An attached buffer group can be closed explicitly
- by the application before the transaction ends.
- .lp
- The following two macros can be used with buffer
- groups
- to give an initial value
- to a buffer group index
- and to recognize that value.
- .(b I
- \fBINVALIDATE_BUFGROUP (int bufgroup)\fR
- .)b
- .lp
- sets the \*(lqbufgroup\*(rq argument to an invalid buffer group index.
- .(b I
- \fBBUFGROUP_IS_INVALID (int bufgroup)\fR
- .)b
- .lp
- returns TRUE if \*(lqbufgroup\*(rq is the value given by
- INVALIDATE_BUFGROUP(\ ), FALSE if it is not.
- BUFGROUP_IS_INVALID(\ ) does not tell if there exists a
- buffer group with the given index.
- .sp
- .(b L
- \fBsm_OpenBufferGroup (groupSize, policy, groupIndex, flags)
- int groupSize; /* IN the maximum group size in pages */
- int policy; /* IN the group's replacement policy */
- int *groupIndex; /* OUT the group's index */
- FLAGS flags; /* IN buffer group attributes */\fR
- .)b
- .(x z
- sm_OpenBufferGroup(\ )
- .)x \*($n
- .lp
- Sm_OpenBufferGroup(\ ) opens a new buffer group.
- The \*(lqgroupSize\*(rq argument specifies the size of the buffer group
- in MIN_PAGESIZE pages.
- The sum of the sizes of all
- open buffer groups cannot exceed the size of the buffer pool.
- (See
- Section 4.11.3, \fBTuning the Application\fR.)
- The choice for \*(lqpolicy\*(rq is
- least-recently-used (BF_LRU)
- or
- most-recently-used (BF_MRU).
- BF_LRU and BF_MRU are defined in \fCsm_client.h\fR.
- The argument \*(lqgroupIndex\*(rq is filled by the Storage Manager and must be
- used in subsequent references to the buffer group.
- (All operations on files and objects require a buffer group index.)
- .lp
- The \*(lqflags\*(rq indicates whether the buffer group is to be associated with
- a transaction.
- NOFLAGS indicates that it is not.
- TRANS_GROUP indicates that the buffer group is associated with the
- current transaction.
- The group is closed by the client library
- when the active transaction ends.
- If TRANS_GROUP is used, a transaction must be running
- at the time sm_OpenBufferGroup(\ ) is called.
- .lp
- The effect of sm_OpenBufferGroup(\ ) is to reserve \*(lqgroupSize\*(rq
- pages in the client's buffer pool. No buffer group is opened on the
- server.
- .sp
- .(b L
- \fBsm_BufferGroupInfo (groupIndex, maxPages, fixedPages, unfixedPages)
- int groupIndex; /* IN the group to inspect */
- int *maxPages; /* OUT max fixed pages allowed */
- int *fixedPages; /* OUT current # of pages fixed */
- int *unfixedPages; /* OUT current # of pages unfixed */\fR
- .)b
- .(x z
- sm_BufferGroupInfo(\ )
- .)x \*($n
- .lp
- Sm_BufferGroupInfo(\ ) returns information about the open buffer group
- identified by \*(lqgroupIndex\*(rq.
- The function returns the buffer group's size limit in pages
- in \*(lqmaxPages\*(rq.
- In \*(lqfixedPages\*(rq, it
- returns the number of pages currently fixed in the buffer group.
- See the next section for more information about these functions.
- The argument \*(lqunfixedPages\*(rq refers to all buffer pages that
- belong to the buffer group, but are not fixed, that is these pages
- may be removed from the buffer pool if space is needed for fixed
- pages.
- .sp
- .(b L
- \fBsm_CloseBufferGroup (groupIndex)
- int groupIndex; /* IN the group being closed */\fR
- .)b
- .(x z
- sm_CloseBufferGroup(\ )
- .)x \*($n
- Sm_CloseBufferGroup(\ ) closes the open buffer group
- identified by \*(lqgroupIndex\*(rq.
- \" **********************************************************************
- .br
- .sh 2 "Operations on Objects"
- .lp
- An object in the Storage Manager is a container of bytes.
- It can be empty.
- It can have as many as 2\*[31\*] bytes, if the volume on which
- it resides is large enough.
- An object must fit on a single volume (storage device or partition).
- When an object is created,
- the Storage Manager gives the object a unique object identifier.
- An object identifier is described by
- a structure of the type OID, defined as follows:.
- .(b I
- \fBtypedef struct {
- SHORTPID pid; /* 32-bit page address of the object's header */
- SLOTINDEX slot; /* 16-bit slot number of the object on the page */
- VOLID volid; /* 16-bit identifier of the volume */
- UNIQUE unique; /* 32-bit number generated at creation time */
- } OID; \fR
- .(x z
- OID
- .)x \*($n
- .)b
- .lp
- The first three fields of an OID are the physical address of the object;
- they identify a volume,
- a page within the volume,
- and a \fIslot\fR on the page.
- .(x z
- slot
- .)x \*($n
- An object's identifier never changes.
- The client library sometimes moves objects, such as when
- an object grows beyond the size of a page, at which time the
- object is marked as \fIforwarded\fR, but its OID remains
- .(x z
- forwarded object
- .)x \*($n
- unchanged.
- .lp
- The \*(lqunique\*(rq field of an OID is special 32-bit value that is generated
- when the object is created and used to detect dangling and corrupted
- OIDs.
- The generation of unique numbers is discussed in Appendix B.
- .lp
- Every time an object is accessed by its OID,
- the Storage Manager validates the OID.
- The application can use the following macros to
- give an illegitimate initial value to an OID,
- and to recognize that value:
- .(b I
- \fBINVALIDATE_OID (OID oid)\fR
- .)b
- sets the \*(lqoid\*(rq argument to an invalid object identifier.
- .(b I
- \fBOID_IS_INVALID (OID oid)\fR
- .)b
- returns TRUE if \*(lqoid\*(rq is the value given by INVALIDATE_OID(\ ),
- FALSE if it is not.
- .lp
- Each object has an \fIobject header\fR, which describes the object,
- .(x z
- object header
- .)x \*($n
- and which can be retrieved without retrieving the object's data.
- The structure of an object header is shown below:
- .(b I
- \fBtypedef struct {
- TWO properties; /* a bit vector */
- TWO tag; /* supplied by the application */
- int size; /* size of the object in bytes */
- } OBJHDR;\fR
- .(x z
- OBJHDR
- .)x \*($n
- .(x z
- object header
- .)x \*($n
- .)b
- .lp
- The \*(lqtag\*(rq is a two-byte field that the Storage Manager does not interpret.
- It is for use by the application.
- No restriction is put on the contents of \*(lqtag\*(rq fields.
- As its name implies, the \*(lqsize\*(rq field is the size of the object in bytes.
- The \*(lqproperties\*(rq field is a read-only bit-vector that indicates the
- presence or absence of the following properties of objects:
- .(b I
- .ip " P_LARGEOBJ" 23
- set if the object is a large object.
- .ip " P_MOVED" 23
- set if this object has been forwarded to another page.
- .ip " P_FROZEN" 23
- set if the object is a frozen version.
- .ip " P_VERSIONED" 23
- set if the object is a frozen version or a descendent of a frozen version.
- .)b
- .lp
- Each object resides in a \fIfile\fR on a \fIvolume\fR.
- .(x z
- file
- .)x \*($n
- .(x z
- volume
- .)x \*($n
- When an object is created, the application tells the client
- library in which file to place the object.
- Files and their uses are discussed in the next section;
- details of their use are not pertinent to understanding
- the operations on objects.
- .lp
- Before an operation can be performed on an existing object,
- the object, or at least the affected parts of the object,
- must be brought into the application's address space.
- This is called \fIpinning\fR the object or its parts.
- .(x z
- pin
- .)x \*($n
- When the object is no longer needed, it must be \fIunpinned\fR,
- .(x z
- unpin
- .)x \*($n
- to make room for other objects to be pinned\**.
- .(f
- \** Objects are pinned; pages are fixed. The gist of the two verbs is the same.
- .)f
- When the client library pins and object in order to perform
- an operation on behalf of the application (for example,
- appending bytes to an object), the client library pins the necessary
- parts of the object and unpins them before it returns control
- to the application.
- When the application
- pins part of an object for its own purposes (such as writing over
- bytes in the object), the pinned part is placed in the client's buffer
- pool, and
- the client library creates a \*(lqhandle\*(rq for the
- the object.
- The handle is called a \fIuser descriptor\fR.
- The application can refer to an object only through user descriptors.
- The application must unpin the object by \fIreleasing\fR
- the user descriptor when it is done using the object.
- .lp
- A user descriptor is called \fIvalid\fR if and only
- if the byte range it addresses is pinned.
- An application can pin an object or overlapping parts of an object
- any number of times, having any number of valid user descriptors
- for the same data in an object.
- (This is not wise for performance reasons, but it can be done.)
- .lp
- The client library functions that pin ranges of bytes
- return user descriptors to describe the bytes pinned.
- Functions that require that the range of bytes they affect be pinned
- take user descriptors as input arguments.
- The client library functions that do not take user descriptor
- arguments do not ultimately change the quantity of bytes pinned
- or the number of pages fixed in the buffer pool.
- \fBSuch functions may change the ranges of bytes addressed or the
- bytes themselves, but they do not change the quantity of bytes
- addressed.\fR
- (For example, the function sm_InsertInObject(\ )
- may affect valid user descriptors even though it does not take
- and user descriptors as arguments.)
- .(x z
- user descriptor
- .)x \*($n
- .lp
- User descriptors have the following form:
- .(b I
- \fBtypedef struct {
- char *basePtr; /* ptr to start of data */
- int byteCount /* number of bytes accessible */
- int objectSize; /* total size of object */
- TWO userFlags; /* properties field from object header */
- TWO type; /* for use only by E */
- TWO flags; /* for use only by E */
- TWO tag; /* tag field from the object header */
- OID oid; /* oid of object being referenced */
- } USERDESC;\fR
- .)b
- .(x z
- USERDESC
- .)x \*($n
- .(x z
- user descriptor
- .)x \*($n
- .lp
- The \*(lqbasePtr\*(rq field of a user descriptor points to the start
- of the object's data in the buffer pool, while the \*(lqbyteCount\*(rq
- field indicates the number of bytes accessible to the application
- program through this user descriptor.
- The value \*(lqobjectSize\*(rq is the length of the entire object.
- The \*(lquserFlags\*(rq field holds a copy of the properties field from the
- object's header.
- The \*(lqtype\*(rq and \*(lqflags\*(rq fields are used by the E language's persistent virtual machine.
- Finally, the \*(lqtag\*(rq field contains a copy of the \*(lqtag\*(rq field in the object's header.
- .lp
- An object's data is referenced indirectly via the \*(lqbasePtr\*(rq field.
- \fBReferences by the application
- must always be indirect via \*(lqbasePtr\*(rq\fR.
- The indirection is necessary because there are times when the
- Storage Manager moves an object in the buffer pool, and
- the \*(lqbasePtr\*(rq of each user descriptor that
- references the object is updated to account for the move.
- .lp
- The remainder of this section describes the Storage Manager
- functions for operating on objects.
- It is divided into sub-sections that describe
- creating and destroying objects,
- pinning and unpinning parts of objects,
- modifying objects,
- and
- using object headers.
- .br
- .sh 3 "Creating and Destroying Objects"
- .sp
- .(b L
- \fBsm_CreateObject (groupIndex, fid, nearHint, nearObj, objHdr, length, data, oid)
- int groupIndex; /* IN buffer group to use */
- FID *fid; /* IN file in which object is to be placed */
- int nearHint; /* IN flag indicating where to create the new object */
- OID *nearObj; /* IN create the new object near this object */
- OBJHDR *objHdr; /* IN the object's header */
- int length; /* IN amount of data */
- void *data; /* IN the initial data for the object */
- OID *oid; /* OUT the new object's OID */\fR
- .)b
- .(x z
- sm_CreateObject(\ )
- .)x \*($n
- .lp
- Sm_CreateObject(\ ) creates an object in the file identified by \*(lqfid\*(rq.
- If \*(lqobjHdr\*(rq is not NULL, the \*(lqtag\*(rq field in the header of the new object
- is initialized with the contents of the \*(lqtag\*(rq field in the header
- structure addressed by \*(lqobjHdr\*(rq.
- When \*(lqdata\*(rq is not NULL, the object is initialized with
- the data addressed by the argument \*(lqdata\*(rq;
- in this case, \*(lqlength\*(rq specifies how much data to copy.
- When \*(lqdata\*(rq is NULL, an object of size \*(lqlength\*(rq is created and
- filled with zeroes.
- .sp
- The argument \*(lqnearHint\*(rq specifies where the new object should be created.
- The following values, defined in \fCsm_client.h\fR, are near hints:
- NEAR_OBJ, NEAR_FIRST, and NEAR_LAST.
- If \*(lqnearHint\*(rq is set to NEAR_OBJ,
- the new object is created near the object designated by \*(lqnearObj\*(rq.
- If \*(lqnearHint\*(rq is set to NEAR_FIRST or NEAR_LAST, \*(lqnearObj\*(rq is
- ignored and the new object is created
- near the first or last object in the file, respectively.
- .lp
- If sm_CreateObject(\ ) is successful, the OID structure pointed to
- by \*(lqoid\*(rq is filled with the OID of the new object.
- Sm_CreateObject(\ ) does not leave the new object pinned.
- .(b L
- \fBsm_DestroyObject (groupIndex, oid)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN the object to destroy */\fR
- .)b
- .(x z
- sm_DestroyObject(\ )
- .)x \*($n
- .lp
- Sm_DestroyObject(\ ) destroys an object.
- If any user descriptors are valid for the object when the object
- is destroyed, they are made invalid, and
- they must be released with
- sm_ReleaseObject(\ ), described below.
- .br
- .sh 3 "Pinning and Unpinning Objects"
- .sp
- .lp
- The following two functions
- change the number of pages fixed in the client buffer pool.
- All the other functions that operate on objects fix
- pages temporarily and unfix the pages before returning.
-
- .(b L
- \fBsm_ReadObject (groupIndex, oid, start, length, userDesc)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object to read */
- int start; /* IN starting offset of read */
- int length; /* IN amount of data to read */
- USERDESC **userDesc; /* OUT descriptor to access the data */\fR
- .)b
- .(x z
- sm_ReadObject(\ )
- .)x \*($n
- .lp
- Sm_ReadObject(\ ) reads part or all of the object identified by \*(lqoid\*(rq
- into the buffer group identified by \*(lqgroupIndex\*(rq.
- If \*(lqlength\*(rq is READ_ALL, the entire object is read (assuming that
- the size of the entire object is not greater than the amount of
- unpinned space in the buffer group).
- Otherwise, the bytes to be read are specified by \*(lqstart\*(rq and \*(lqlength\*(rq.
- .lp
- Sm_ReadObject(\ ) pins the specified range of bytes in the buffer pool
- and returns a user descriptor to the caller.
- .(x z
- pin, object
- .)x \*($n
- \fBBytes pinned in the buffer pool by sm_ReadObject(\ ) remain
- pinned until they are explicitly released by sm_ReleaseObject(\ ).\fR
- .lp
- While sm_ReadObject(\ ) can be used to get information about
- the object (from the object header) by giving it a length of zero,
- sm_ReadObjectHeader(\ ) is the preferred way to meet
- the same objective.
- Sm_ReadObject(\ ) performs work
- that is unnecessary when only the object header is of interest,
- and it always fixes at least one page in the buffer pool,
- even if the given length is zero.
- .lp
- The user descriptor consumes resources that must be
- freed with sm_ReleaseObject(\ ), even if the object is not pinned
- .(x z
- sm_ReleaseObject(\ )
- .)x \*($n
- (zero is given for \*(lqlength\*(rq).
- .sp
- .(b L
- \fBsm_ReleaseObject (userDesc)
- USERDESC *userDesc; /* IN descriptor returned by ReadObject */\fR
- .)b
- .(x z
- sm_ReleaseObject(\ )
- .)x \*($n
- .lp
- Sm_ReleaseObject(\ )
- unpins a range of bytes of an object that was pinned by sm_ReadObject(\ ),
- and frees the resources associated with the user descriptor.
- If the user descriptor is not valid,
- sm_ReleaseObject(\ ) sets
- sm_errno to esmBADUSERDESC and returns esmFAILURE.
- .br
- .sh 3 "Modifying Objects"
- .lp
- Four functions modify objects:
- .(x z
- object, modifying
- .)x \*($n
- sm_WriteObject(\ ),
- sm_InsertInObject(\ ),
- sm_AppendToObject(\ ),
- and sm_DeleteFromObject(\ ).
- Sm_WriteObject(\ ) cannot be used to change the size of an object,
- .(x z
- sm_WriteObject(\ )
- .)x \*($n
- only to overwrite parts of an object.
- The other three functions can change the size of an object.
- These functions provide substantial flexibility, and their efficiency varies.
- Changing the size of a small object (one that fits on a disk page)
- is relatively inexpensive.
- It is less expensive than reading and writing the object.
- For large objects, performing many small-size changes can be
- expensive in CPU time and buffer space utilization.
- If a large object is pinned several times simultaneously, through
- different user descriptors,
- updates to the object are very expensive.
- If a large number of small-size changes is required,
- we recommend accumulating the changes and performing them in larger chunks.
- .sp
- .(b L
- \fBsm_WriteObject (groupIndex, start, length, data, userDesc, release)
- int groupIndex; /* IN buffer group in use */
- int start; /* IN starting offset of write */
- int length; /* IN amount of data to be written */
- void *data; /* IN pointer to the data */
- USERDESC *userDesc; /* IN descriptor returned by ReadObject */
- BOOL release; /* IN whether to release the object */\fR
- .)b
- .(x z
- sm_WriteObject(\ )
- .)x \*($n
- .lp
- Sm_WriteObject(\ ) overwrites the region of bytes from
- (userDesc->baseptr\ +\ start)
- to
- (userDesc->baseptr\ +\ start\ +\ length\ -\ 1)
- with the data addressed by the \*(lqdata\*(rq argument.
- The given byte range must have been pinned (which means that
- the user descriptor must be valid).
- If \*(lqrelease\*(rq is TRUE, the range of bytes given by \*(lquserDesc\*(rq
- is unpinned when sm_WriteObject(\ ) returns.
- If \*(lqdata\*(rq is NULL, the region is filled with zeroes.
- \fBAll updates to objects must be performed using sm_WriteObject(\ )\fR
- so that the updates can be logged, and the
- transaction semantics can be guaranteed.
- .sp
- .(b L
- \fBsm_InsertInObject (groupIndex, oid, start, length, data)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object we're inserting into */
- int start; /* IN starting offset of insert */
- int length; /* IN amount of data being inserted */
- void *data; /* IN data to insert */\fR
- .)b
- .(x z
- sm_InsertInObject(\ )
- .)x \*($n
- .lp
- Sm_InsertInObject(\ ) inserts \*(lqlength\*(rq bytes of data into
- an object, beginning at the offset \*(lqstart\*(rq.
- If \*(lqdata\*(rq is NULL, the inserted region is filled with zeroes.
- If there are any valid user descriptors
- (those for which sm_ReleaseObject(\ ) has not been called)
- for the object at the
- time the insertion takes place, they are reestablished if necessary.
- After the insertion, the base pointers of the valid user descriptors
- point to the byte within the object indicated by
- the \*(lqstart\*(rq argument to the sm_ReadObject(\ )
- operation that created the user descriptor.
- For example,
- an object's first five bytes, "ABCDE" are pinned
- by sm_ReadObject(\ ),
- which was called with a \*(lqstart\*(rq offset of zero and a \*(lqlength\*(rq of five.
- Sm_ReadObject(\ ) returns a user descriptor, U, which addresses
- "ABCDE".
- Sm_InsertInObject(\ ) inserts "ZZ" at \*(lqstart\*(rq offset zero.
- The user descriptor U now addresses "ZZABC", which are pinned, while
- the bytes "DE" are no longer pinned.
- .sp
- .(b L
- \fBsm_AppendToObject (groupIndex, oid, length, data)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object we are appending data to */
- int length; /* IN amount of data being appended */
- void *data; /* IN data to append */\fR
- .)b
- .(x z
- sm_AppendToObject(\ )
- .)x \*($n
- .lp
- Sm_AppendToObject(\ ) appends \*(lqlength\*(rq bytes of data to the end of an object.
- Outstanding user descriptors are handled the same way as
- sm_InsertInObject(\ ).
- If \*(lqdata\*(rq is NULL, the appended region is filled with zeroes.
- .sp
- .(b L
- \fBsm_DeleteFromObject (groupIndex, oid, start, length)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object we're inserting into */
- int start; /* IN starting offset of delete */
- int length; /* IN amount of data being deleted */\fR
- .)b
- .(x z
- sm_DeleteFromObject(\ )
- .)x \*($n
- .lp
- Sm_DeleteFromObject(\ ) deletes \*(lqlength\*(rq bytes of
- data from an object, beginning with the byte indicated by
- the offset \*(lqstart\*(rq.
- .(x z
- user descriptor
- .)x \*($n
- Sm_DeleteFromObject(\ ) is analogous to sm_InsertObject(\ ).
- All valid user descriptors affected by the deletion are,
- if possible, reset to point to the new absolute
- offset within the object.
- There are two cases when this is not possible.
- .np
- The object's size is now smaller
- than the starting offset of a user descriptor.
- The \*(lqbasePtr\*(rq field
- in the user descriptor is set to NULL and
- the user descriptor is made invalid.
- The user descriptor must be released by sm_ReleaseObject(\ )
- so that its resources can be reclaimed.
- .np
- The object's size is now smaller than the original
- byte range addressable by a user descriptor.
- The size of the range addressable by the descriptor is
- reduced to reflect the new size of the object.
- .br
- .sh 3 "Object Headers"
- .(x z
- object header
- .)x \*($n
- .sp
- .(b L
- \fBsm_ReadObjectHeader (groupIndex, oid, objHdr)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN read this object's header */
- OBJHDR *objHdr; /* OUT place to put the header */\fR
- .)b
- .(x z
- sm_ReadObjectHeader
- .)x \*($n
- .lp
- Sm_ReadObjectHeader(\ ) reads an object's header
- into the structure addressed by \*(lqobjHdr\*(rq.
- This function is the preferred one to use to determine
- if an object's identifier is valid.
- If the object's identifier is invalid, Sm_ReadObjectHeader(\ )
- returns esmFAILURE and puts esmBADOID in sm_errno.
- .sp
- .(b L
- \fBsm_SetObjectHeader (groupIndex, oid, objHdr)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN set this object's header flags */
- OBJHDR *objHdr; /* IN the new header */\fR
- .)b
- .(x z
- sm_SetObjectHeader
- .)x \*($n
- .lp
- Sm_SetObjectHeader(\ ) modifies an object's header.
- Only the \*(lqtags\*(rq field is modified; the other
- fields are read-only.
- .br
- .sh 2 "Versions of Objects"
- .(x z
- version of object
- .)x \*($n
- .lp
- In order to allow efficient updating of shared data,
- the Storage Manager offers versions of objects.
- Versions come in two kinds:
- \fIworking versions\fR and \fIfrozen versions\fR.
- .(x z
- version, working
- .)x \*($n
- .(x z
- working version
- .)x \*($n
- .(x z
- version, frozen
- .)x \*($n
- .(x z
- frozen version
- .)x \*($n
- A working version of an object is one that can be modified.
- Every object has at least one version, which is the object itself.
- A working version may be frozen, after which it can no longer be modified.
- .lp
- A new working version, called a \fIdescendent\fR,
- can be made of a frozen object.
- The descendent looks like a new object that is a copy of the
- frozen object from which it came.
- The Storage Manager determines when it is necessary and efficient
- to make a copy of the frozen object, and makes the copy at that time.
- .sp
- .(b L
- \fBsm_CreateVersion (groupIndex, nearHint, parentObj, nearObj, oid)
- int groupIndex; /* IN buffer group to use */
- int nearHint; /* IN flag indicating where to create the new version near */
- OID *parentObj; /* IN object to create a version of */
- OID *nearObj; /* IN create the new version near this object */
- OID *oid; /* OUT the new version's OID */\fR
- .)b
- .(x z
- sm_CreateVersion
- .)x \*($n
- .lp
- Sm_CreateVersion(\ ) creates a new version of the object \*(lqparentObj\*(rq in
- the file containing \*(lqparentObj\*(rq.
- The arguments \*(lqgroupIndex\*(rq, \*(lqnearHint\*(rq, and \*(lqnearObj\*(rq are
- used as in sm_CreateObject(\ ).
- The object identifier of the new version is returned in \*(lqoid\*(rq.
- \fBThe object identified by \*(lqparentObj\*(rq must be a frozen version\fR.
- The new version is a working version.
- The new version can be destroyed using sm_DestroyObject(\ ).
- When a new version is created, the P_VERSIONED property is set in
- the object header.
- .lp
- Like sm_CreateObject(\ ), sm_CreateVersion(\ ) does not leave anything
- pinned in the buffer pool.
- .sp
- .(b L
- \fBsm_FreezeVersion (groupIndex, oid)
- int groupIndex; /* IN buffer group to use */
- OID *oid; /* IN object to be frozen */\fR
- .)b
- .(x z
- sm_FreezeVersion
- .)x \*($n
- .lp
- Sm_FreezeVersion(\ ) marks an object as frozen, preventing
- subsequent modification of the object, and allowing new working
- versions to be made from this object.
- When an object is frozen, both
- the P_VERSIONED and the P_FROZEN properties are set in the object
- header.
- Once frozen, an object cannot be unfrozen.
- A frozen object can be destroyed.
- .br
- .sh 2 "Operations on Files"
- .(x z
- file, what is a
- .)x \*($n
- .(x z
- file, operations on
- .)x \*($n
- .lp
- A Storage Manager file is a flexible container in which objects are
- place when they are created.
- No object exists outside a file.
- .lp
- The objects in a file can be \fIscanned\fR, meaning that
- .(x z
- scanning a file
- .)x \*($n
- they are visited exactly once.
- .lp
- Files do not have preallocated space or ownership properties.
- Various consistency guarantees can be associated with files,
- with the effect that updating data in different files has
- different costs.
- .lp
- The Storage Manager offers operations
- for creating, destroying, scanning, bulk-loading files,
- and for changing the consistency guarantees associated with files.
- Some operations on files acquire locks on entire files.
- The locks acquired are described in Appendix A.
- .lp
- A file is identified by a unique file identifier or FID.
- The Storage Manager does not provide a way to find all files or
- file identifiers that exist, so
- it is left to the application to keep track of its file identifiers.
- .(x z
- FID
- .)x \*($n
- .(x z
- file identifier
- .)x \*($n
- For example, consider an application that
- embeds file identifiers in objects to create a logical hierarchy of files.
- The application had best destroy the files
- in a depth-first fashion, lest it lose a file identifier
- before the file it identifies is destroyed.
- .lp
- The following two macros can be used to give a file identifier
- an illegitimate initial value, and later to recognize that value:
- .(b I
- \fBINVALIDATE_FID (FID fid)\fR
- .)b
- .(x z
- INVALIDATE_FID
- .)x \*($n
- sets \*(lqfid\*(rq to an invalid file identifier.
- .(b I
- \fBFID_IS_INVALID (FID fid)\fR
- .)b
- .(x z
- FID_IS_INVALID
- .)x \*($n
- returns TRUE if \*(lqfid\*(rq is the invalid identifier given by
- INVALIDATE_FID(\ ), FALSE otherwise.
- .lp
- The rest of this section describes operations on files
- and operations that concern entire files of objects.
- .sh 3 "Consistency Guarantees for Files"
- .(x z
- files, consistency guarantees for
- .)x \*($n
- .lp
- The \fIlog level\fR of a file determines what
- .(x z
- files, log level
- .)x \*($n
- level of consistency is maintained for the file
- in the event that a transaction aborts or a server crashes.
- There are two log levels for files on data volumes:
- LOG_ALL and LOG_SPACE.
- LOG_ALL indicates that consistency is maintained for user data and
- meta-data.
- LOG_SPACE indicates that meta-data are guaranteed to be consistent.
- This means that all objects are available and that they are
- the correct size, but their contents may be corrupted.
- Files that have their log level set to LOG_SPACE are
- flushed when the transaction is committed.
- \fBData pages for
- large objects (objects that do not fit on a single disk page) may not be
- flushed, so there is no guarantee
- that the data is safely on disk until
- the server dismounts the volume.\fR
- The log level is not a permanent attribute of a file.
- When an application sets the log level for a file,
- the setting lasts until it is changed or until sm_ShutDown(\ ) is called.
- If, in a transaction, the log level for a file is changed from
- LOG_SPACE to LOG_ALL, the Storage Manager guarantees only
- that the meta-data are consistent.
- .lp
- LOG_ALL is the default log level for data files.
- .(x z
- log level, default
- .)x \*($n
- .(x z
- default log level
- .)x \*($n
- LOG_SPACE is designed to conserve
- log space and increase performance for those files whose data
- integrity is not critical.
- For example, results of a query may be stored in a
- file with its log level set to LOG_SPACE,
- since file can be regenerated, in the event of a failure.
- To conserve log space when loading a large file,
- the log level for a file may be set to LOG_SPACE.
- Once the loading transaction is committed,
- the log level should be set to LOG_ALL.
- .lp
- Files on temporary volumes can have only one log level: LOG_NONE.
- .(x z
- temporary volume
- .)x \*($n
- See
- Section 5.1.3, \fBTemporary Volumes\fR,
- for more information about temporary volumes.
- .lp
- Sm_SetLogLevel(\ ) is used to change the log level for a list
- of files:
- .sp
- .(b L
- \fBsm_SetLogLevel (logLevel, fileCount, fids)
- int logLevel; /* IN log level */
- int fileCount; /* IN number of files to set level for */
- FID fid[]; /* IN list of files to set level for */ \fR
- .)b
- .(x z
- sm_SetLogLevel(\ )
- .)x \*($n
- .lp
- The \*(lqlogLevel\*(rq argument takes
- the values LOG_SPACE and LOG_ALL.
- The \*(lqfileCount\*(rq argument indicates the size of the
- last argument, \*(lqfid[]\*(rq,
- which is a list of file identifiers of the files whose log
- levels are to be affected by this function.
- It is not an error for a file in the list already
- to have the given log level.
- .lp
- If \*(lqfileCount\*(rq is zero,
- \fBall\fR files are given \*(lqlogLevel\*(rq.
- .lp
- The volumes on which the files
- reside must be available for mounting,
- and a side effect of setting the log level
- is that the volumes are mounted.
- .lp
- Sm_SetLogLevel(\ ) has no effect on files that
- reside on temporary volumes
- .(x z
- temporary volume
- .)x \*($n
- (see Section 5.1.3, \fBTemporary Volumes\fR).
- .(b L
- \fBsm_CreateFile (groupIndex, volid, fid)
- int groupIndex; /* IN buffer group in use */
- VOLID volid; /* IN the volume in which to place the file */
- FID *fid; /* OUT the file ID of the new file */
- .)b
- .(x z
- sm_CreateFile(\ )
- .)x \*($n
- .lp
- Sm_CreateFile(\ ) creates a new file on the volume indicated by \*(lqvolid\*(rq.
- The file identifier of the new file is returned in the structure
- to which \*(lqfid\*(rq points.
- The caller is responsible for allocating space for the FID.
- .sp
- .(b L
- \fBsm_DestroyFile (groupIndex, fid)
- int groupIndex; /* IN buffer group in use */
- FID *fid; /* IN the file to destroy */\fR
- .)b
- .(x z
- sm_DestroyFile(\ )
- .)x \*($n
- .lp
- Sm_DestroyFile(\ ) destroys the file identified by \*(lqfid\*(rq.
- The objects in the file are destroyed along with the file.
- Disk space is released when the transaction is committed.
- .sp
- .(b L
- \fBsm_GetFirstOid (groupIndex, fid, oid, objHdr, emptyFlag)
- int groupIndex; /* IN buffer group in use */
- FID *fid; /* IN the file */
- OID *oid; /* OUT first OID */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *emptyFlag; /* OUT empty file flag */\fR
- .)b
- .(x z
- sm_GetFirstOid(\ )
- .)x \*($n
- .lp
- Sm_GetFirstOid(\ ) retrieves the object identifier and the
- object header of the first object in the file designated by \*(lqfid\*(rq.
- The first object is the first object on the first physical page in the file.
- If the file does not contain any objects, \*(lqemptyFlag\*(rq is set to TRUE.
- .sp
- .(b L
- \fBsm_GetLastOid (groupIndex, fid, oid, objHdr, emptyFlag)
- int groupIndex; /* IN buffer group in use */
- FID *fid; /* IN the file */
- OID *oid; /* OUT last OID */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *emptyFlag; /* OUT empty file flag */\fR
- .)b
- .(x z
- sm_GetLastOid(\ )
- .)x \*($n
- .lp
- Sm_GetLastOid(\ ) retrieves the object identifier and
- the object header of the last object in
- the file designated by \*(lqfid\*(rq.
- The last object is the
- last object on the last physical page in the file.
- If the file does not contain any objects, \*(lqemptyFlag\*(rq is set to TRUE.
- .sp
- .(b L
- \fBsm_GetNextOid (groupIndex, baseOid, nextOid, objHdr, endMarker)
- int groupIndex; /* IN buffer group in use */
- OID *baseOid; /* IN next relative to this object */
- OID *nextOid; /* OUT OID of the next object */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *endMarker; /* OUT end-of-file flag */\fR
- .)b
- .(x z
- sm_GetNextOid(\ )
- .)x \*($n
- .lp
- Sm_GetNextOid(\ ) retrieves the object identifier and
- the object header of the next object
- in the file relative to the object addressed by \*(lqbaseOid\*(rq.
- \*(lqEndMarker\*(rq is set to TRUE when end-of-file is reached
- (i.e., when there is no next object for sm_GetNextOid(\ ) to return).
- .lp
- The next object is that which resides \fIphysically\fR next in the file.
- There is no way to scan a file's objects in the order in
- which they were inserted in the file.
- .lp
- The preferred method for retrieving all the objects in a file is to
- use scans, described in the next sub-section.
- Scans are more efficient than using sm_GetNextOid(\ ), which is
- present for backward compatibility.
- .sp
- .(b L
- \fBsm_GetPreviousOid (groupIndex, baseOid, prevOid, objHdr, endMarker)
- int groupIndex; /* IN buffer group in use */
- OID *baseOid; /* IN previous relative to this object */
- OID *prevOid; /* OID of the previous object */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *endMarker; /* OUT start-of-file flag */
- .)b
- .(x z
- sm_GetPreviousOid(\ )
- .)x \*($n
- .lp
- Sm_GetPreviousOid(\ ) retrieves the object identifier and object
- header of the previous object in the file relative to the object addressed
- by \*(lqbaseOid\*(rq.
- \*(lqEndMarker\*(rq is set to TRUE when start-of-file is reached
- (i.e., when there is no next object for sm_GetPreviousOid(\ ) to return).
- Much like sm_GetNextOid(\ ), the previous object
- is the object that is \fIphysically\fR previous in the file.
- .br
- .sh 3 "Scanning Files"
- .lp
- The objects in a file can be visited most efficiently by scanning the
- file.
- During a \fIscan\fR, the client library
- locks the entire file so that
- while one application is using the file,
- objects cannot be inserted, deleted, or changed by another application.
- \fBThe Storage Manager does not support a single application's
- modifying a file during a scan.\fR
-
- The client library also some information
- about the state of the scan and the structure of the file being
- scanned.
- The information is stored in a
- \fIscan descriptor\fR, a structure of type \fBSCANDESC\fR,
- which is meant to be treated as \fIopaque\fR by the application.
- .sp
- .(b L
- \fBsm_OpenScanWithGroup (fid, type, groupIndex, scanDesc, oid)
- FID *fid; /* IN file to scan */
- int type; /* IN type of scan -- UNUSED */
- int groupIndex; /* IN buffer group for use in scan */
- SCANDESC **scanDesc; /* OUT returned scan descriptor */
- OID *oid; /* IN optional oid to begin scan -- UNUSED */
- .)b
- .(x z
- sm_OpenScanWithGroup(\ )
- .)x \*($n
- .lp
- Sm_OpenScanWithGroup(\ ) initializes a
- scan on the file indicated by \*(lqfid\*(rq.
- A scan descriptor is passed back in \*(lqscanDesc\*(rq, for use
- in subsequent scan calls.
- Using the scan mechanism can be considerably more efficient
- that using the sm_GetNextOid(\ ) call or sm_ReadObject(\ ).
- Scans use a buffer group, \*(lqgroupIndex\*(rq.
- This group should be created
- with the most-recently-used replacement policy,
- and its size should be tuned to
- reflect the buffering requirements for the scan.
- The buffer group should have a size of at least five pages.
- .lp
- Objects are scanned in the order in which they physically reside on disk.
- After sm_OpenScanWithGroup(\ ) returns, the scan pointer is
- before the first object in the file. This is true even if
- the file is empty, in which case the first call to sm_ScanNextObject(\ )
- returns a flag indicating the end-of-file condition.
- The \*(lqtype\*(rq and \*(lqoid\*(rq arguments are not used and are
- present for backward compatibility.
- .sp
- .(b L
- \fBsm_OpenScan (fid, type, groupSize, scanDesc, oid)
- FID *fid; /* IN file to scan */
- int type; /* IN type of scan -- UNUSED */
- int groupSize; /* IN size of buffer group in pages */
- SCANDESC **scanDesc; /* OUT returned scan descriptor */
- OID *oid; /* IN optional oid to begin scan -- UNUSED */
- .)b
- .(x z
- sm_OpenScan(\ )
- .)x \*($n
- .lp
- Sm_OpenScan(\ ) is like
- sm_OpenScanWithGroup(\ ), but it is
- less flexible, and it is provided for backward compatibility.
- It is identical to sm_OpenScanWithGroup(\ )
- except that it creates a buffer group with
- the most-recently-used replacement policy and size
- \*(lqgroupSize\*(rq.
- \*(lqGroupSize\*(rq should be at least five (pages).
- The buffer group is destroyed when the scan is closed.
- .sp
- .(b L
- \fBsm_ScanNextObject (scanDesc, start, length, retDesc, eof)
- SCANDESC *scanDesc; /* IN scan descriptor */
- int start; /* IN starting offset in object */
- int length; /* IN number of bytes to read */
- USERDESC **retDesc; /* OUT descriptor to access the data */
- BOOL *eof; /* OUT end of file indicator */
- .)b
- .(x z
- sm_ScanNextObject(\ )
- .)x \*($n
- .lp
- sm_ScanNextObject(\ ) reads the next object in the file
- and pins the object as if sm_ReadObject(\ ) were used.
- \*(lqScanDesc\*(rq is the scan descriptor returned when the scan was opened.
- \*(lqStart\*(rq is the starting offset within the object to return.
- .lp
- \*(lqLength\*(rq is the length of the object read to
- perform.
- If \*(lqlength\*(rq is READ_ALL, the entire object is read (assuming
- that the size of the entire object is not greater than the amount of
- unpinned space in the buffer group).
- To obtain the object header and OID information for the object,
- use a \*(lqlength\*(rq of zero.
- .lp
- sm_ScanNextObject(\ ) returns a user descriptor for the object,
- if there is one to pin, whether or not any bytes are pinned.
- \*(lqEof\*(rq is set to TRUE and
- \*(lqretDesc\*(rq is set to NULL
- when there are no more objects to be scanned.
- Each call to sm_ScanNextObject(\ ) releases the user descriptor
- returned by the previous scan call, so
- \fBsm_ReleaseObject(\ ) must not be used\fR
- .(x z
- sm_ReleaseObject(\ )
- .)x \*($n
- on user descriptors that are acquired by scanning files.
- .sp
- .(b L
- \fBsm_ScanNextBytes (scanDesc, length)
- SCANDESC *scanDesc; /* IN scan descriptor */
- int length; /* IN number of bytes to read */
- .)b
- .(x z
- sm_ScanNextBytes(\ )
- .)x \*($n
- .lp
- Sm_ScanNextBytes(\ ) is useful when a file being scanned contains very
- large objects that cannot be expected to fit in memory.
- A sm_ScanNextObject(\ ) call can be made with a relatively small length to
- read in the first section of an object.
- Sm_ScanNextBytes(\ ) is used subsequently to iterate over the rest of
- that object, with each call reading in the next \*(lqlength\*(rq bytes
- of the current scan object. The iteration can
- be controlled by observing the objectSize field of the user
- descriptor. esmENDOFOBJECT is returned if there are no more bytes
- to be read in the current object.
- .sp
- .(b L
- \fBsm_CloseScan (scanDesc)
- SCANDESC *scanDesc; /* IN scan descriptor */
- .)b
- .(x z
- sm_CloseScan(\ )
- .)x \*($n
- .lp
- Sm_CloseScan(\ ) closes the scan
- associated with \*(lqscanDesc\*(rq.
- It releases the scan descriptor
- and the user descriptors and data pinned during
- the scan.
- .br
- .sh 3 "Bulk-loading Files"
- .lp
- \fBWARNING\fR: the file bulk load facility does not work properly in
- version \*V. We recommend that it not be used.
- .sp
- .(b L
- \fBsm_OpenLoad (fid, type, groupSize, fillFactor, loadDesc)
- FID *fid; /* IN file to scan */
- int groupSize; /* IN size of load buffer group */
- float fillFactor; /* IN fill percentage */
- LOADDESC **loadDesc; /* OUT returned load descriptor */
- .)b
- .(x z
- sm_OpenLoad(\ )
- .)x \*($n
- .lp
- Sm_OpenLoad(\ ) prepares to load a set of objects into a file in bulk.
- Bulk loading a file can be more efficient than using a series of
- sm_CreateObject(\ ) calls.
- The file, indicated by \*(lqfid\*(rq, need not be empty, in which case the new objects
- are added to the end of the file.
- The load mechanism creates and uses its own buffer group;
- the size of the buffer group is \*(lqgroupSize\*(rq.
- .\" The \*(lqfillFactor\*(rq indicates how full to fill pages with objects.
- .\" A valid value is in the range 0.00 to 1.00, inclusive; 0.00 indicates empty and 1.00 indicates full.
- The \*(lqfillFactor\*(rq argument is ignored; it is present for future extensions.
- A \fIload descriptor\fR, \*(lqloadDesc\*(rq is
- .(x z
- load descriptor
- .)x \*($n
- returned for use in subsequent operations (
- sm_LoadNextObject(\ ) and
- sm_CloseLoad(\ )).
- .sp
- .(b L
- \fBsm_LoadNextObject (loadDesc, length, data, oid)
- LOADDESC *loadDesc; /* IN load descriptor */
- int length; /* IN length of the object */
- void *data; /* IN the object's data */
- OID *oid; /* OUT returned new object id */
- .)b
- .(x z
- sm_LoadNextObject(\ )
- .)x \*($n
- .lp
- Sm_LoadNextObject(\ ) creates a new object if size
- \*(lqlength\*(rq
- in the file for which the \*(lqloadDesc\*(rq was opened.
- The new object is initialized with \*(lqdata\*(rq.
- If \*(lqdata\*(rq is NULL, the object is filled with zeroes.
- Sm_LoadNextObject(\ ) returns an object identifier for the new object in
- \*(lqoid\*(rq.
- .sp
- .(b L
- \fBsm_CloseLoad (loadDesc)
- LOADDESC *loadDesc; /* IN load to close */\fR
- .)b
- .(x z
- sm_CloseLoad(\ )
- .)x \*($n
- .lp
- Sm_CloseLoad(\ ) ends the bulk-load operation.
- .br
- .sh 2 "Operations on Indexes"
- .(x z
- index, operations on
- .)x \*($n
- .lp
- The Storage Manager's index facility associates keys with
- fixed-length elements.
- The keys can be any basic C data type
- (SM_int, SM_long, SM_short, SM_float, SM_double) or strings (SM_string).
- The size of the element is fixed when the index is created.
- .lp
- B\*[+\*]tree
- index and linear hashing index functions are
- implemented.
- B\*[+\*]tree
- provides fast index lookup
- on all kinds of queries, especially range queries.
- Linear hashing provides even faster index lookup and supports linear space growth
- for dynamically growing indexes, but
- it supports only exact-match queries.
- More information about linear hashing can be found in [Litw88].
- .lp
- A key is fully described by the \fBKEY\fR structure:
- .sp
- .(b L
- \fBtypedef struct {
- TWO length; /* length of the key */
- void* valuePtr; /* pointer to value of the key */
- } KEY; \fR
- .)b
- .(x z
- KEY
- .)x \*($n
- .lp
- Index keys are compared according to the key type given when
- the index is created.
- The key type determines the number of bytes considered in a key
- comparison.
- In the case of keys that are strings, the length fields in the keys
- in question determine the number of bytes compared.
- Strings are compared one character at a time.
- The client library does not terminate strings with nulls.
- When two strings of different lengths are compared, the shorter
- string is compared with the corresponding substring of the longer string.
- If the shorter string and the corresponding substring are equal,
- the longer string is considered to be the larger of the two.
- This means that "abc\0" is longer than "abc".
- .lp
- Characters are compared as ASCII values.
- .sh 3 "Creating and Destroying Indexes "
- .lp
- When an index is created, the client library creates a handle,
- by which the index is identified in subsequent operations.
- The handle is an \fIindex identifier\fR, a structure of type IID.
- .(x z
- index identifier
- .)x \*($n
- The value of the index identifier can be treated as an
- opaque value by the application.
- .lp
- The following macros can be used it give
- an illegitimate initial value to an
- index identifier, and later to recognize that value:
- .(b I
- \fBINVALIDATE_IID (IID iid)\fR
- .)b
- sets \*(lqiid\*(rq to an invalid index identifier.
- .(b I
- \fBIID_IS_INVALID (IID iid)\fR
- .)b
- returns TRUE if \*(lqiid\*(rq has the value given by
- INVALIDATE_IID(\ ), FALSE if not.
- .lp
- The rest of this section describes the functions that
- .(x z
- index, operations on
- .)x \*($n
- .(x z
- index
- .)x \*($n
- operate on indexes.
- .sp
- .(b L
- \fBsm_CreateIndex(volume, groupIndex, ndxType, keyType, maxKeyLen, elSize, unique, ndx)
- VOLID volume; /* IN volume on which index is to be built */
- int groupIndex; /* IN the buffer group to use */
- SMTYPE ndxType; /* IN SM_BTREENDX, SM_HASHNDX, etc */
- SMDATATYPE keyType; /* IN SM_int, SM_long, SM_string, etc */
- int maxKeyLen; /* IN maximum key length of a key in the index */
- int elSize; /* IN element size (mpl of 4, < SM_MAXELEMLEN) */
- BOOL unique; /* IN TRUE if key is unique */
- IID* ndx; /* OUT returned index identifier */ \fR
- .)b
- .(x z
- sm_CreateIndex(\ )
- .)x \*($n
- .lp
- Sm_CreateIndex(\ ) creates an index that resides on \*(lqvolume\*(rq. \**
- .(f
- \** Indexes on temporary volumes are not implemented.
- (Section 5.1.3, \fBTemporary Volumes\fR).
- If the volume given is temporary, sm_CreateIndex(\ )
- returns esmFAILURE, with error code esmNOTIMPLEMENTED.
- .)f
- \*(lqNdxType\*(rq specifies the type of index
- (SM_BTREENDX for
- B\*[+\*]tree
- or SM_HASHNDX for linear hashing).
- \*(lqKeyType\*(rq indicates the data type of the key.
- The maximum length of a key in the index is given in \*(lqmaxKeyLen\*(rq.
- The size of the elements in the index is given in \*(lqelSize\*(rq.
- The element size must be a multiple of four and less than
- SM_MAXELEMLEN (20).
- If \*(lqunique\*(rq is FALSE,
- the index is able to store multiple elements under the same key.
- An index identifier is returned in \*(lqndx\*(rq upon successful completion.
- .sp
- .(b L
- \fBsm_DestroyIndex(ndx, groupIndex)
- IID* ndx; /* IN id of index to destroy */
- int groupIndex; /* IN which buffer group to use */\fR
- .)b
- .(x z
- sm_DestroyIndex(\ )
- .)x \*($n
- .lp
- Sm_DestroyIndex(\ ) destroys the index associated with \*(lqndx\*(rq.
- .sp
- .(b L
- \fBsm_SetLHashLoadThreshold(ndx, groupIndex, load)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- float loadFactor; /* IN the load factor to use for linear hashing */
- .)b
- .(x z
- sm_SetLHashLoadThreshold(\ )
- .)x \*($n
- .lp
- Sm_SetLHashLoadThreshold(\ )
- changes the load factor for a linear hashing index from the
- default 75% to the given \*(lqloadFactor\*(rq.
- .(x z
- load factor, default for linear hashing indexes
- .)x \*($n
- .(x z
- default load factor
- .)x \*($n
- The default load factor, 75%, yields the best
- access time and space utilization.
- See [Litw88] for information about linear hashing and
- when it might be useful to change the load factor.
- The load factor can be set only on a newly created index.
- .br
- .sh 3 "Inserting and Removing Index Elements "
- .sp
- .(b L
- \fBsm_InsertEntry(ndx, groupIndex, key, elem)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- KEY* key; /* IN key to insert */
- void* elem; /* IN element associated with key */ \fR
- .)b
- .(x z
- sm_InsertEntry(\ )
- .)x \*($n
- .lp
- Sm_InsertEntry(\ ) inserts a <key, elem> pair into the index \*(lqndx\*(rq.
- If \*(lqndx\*(rq is a unique index and the key to be inserted already appears
- in the index, sm_InsertEntry(\ ) returns an error in sm_errno.
- If the index is not unique, there is no limit to the number
- of duplicate keys as long as different elements are associated with them.
- .sp
- .(b L
- \fBsm_RemoveEntry(ndx, groupIndex, key, elem)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- KEY* key; /* IN key to remove */
- void* elem; /* IN element associated with key */ \fR
- .)b
- .(x z
- sm_RemoveEntry(\ )
- .)x \*($n
- .lp
- Sm_RemoveEntry(\ ) removes a <key, elem> pair from the index \*(lqndx\*(rq.
- .br
- .sh 3 "Loading Indexes in Bulk"
- .lp
- The Storage Manager provides a bulk-load facility for
- efficiently loading an empty index.
- When the application begins a bulk-load operation,
- the client library
- allocates a temporary run-buffer, which is used for sorting runs.
- Henceforth, the application uses sm_InsertEntry(\ )
- repeatedly to load elements into index; no other
- index operations are allowed during a bulk-load.
- Each sm_InsertEntry(\ ) operation
- for the index inserts a <key, elem> pair into the
- temporary run buffer.
- The run buffer is sorted and written to the work file as
- a \*(lqsorted-run\*(rq when it is full.
- When the application terminates the bulk-load operation,
- the client library
- merges the sorted-runs into a sorted stream, from
- which the index is built from the bottom, up.
- .lp
- Entries cannot be removed during a bulk-load operation.
- .sp 2
- .(b L
- \fBint sm_BeginIndexLoad(ndx, groupIndex, workVolume, runSize)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN the buffer group to use */
- VOLID workVolume; /* IN work volume */
- int runSize; /* IN size of each sorted run in pages */ \fR
- .)b
- Sm_BeginIndexLoad(\ ) prepares to load the index given in \*(lqndx\*(rq,
- using the buffer group \*(lqgroupIndex\*(rq.
- Sm_BeginIndexLoad(\ ) uses the volume named by
- \*(lqworkVolume\*(rq for the sorted runs.
- Using a temporary volume for the work volume yields
- .(x z
- temporary volume
- .)x \*($n
- the best performance
- (see Section 5.1.3, \fBTemporary Volumes\fR).
- .lp
- The \*(lqrunSize\*(rq argument determines how many MIN_PAGESIZE pages
- to fill before ending a run.
- The larger \*(lqrunSize\*(rq, the
- more memory is consumed by the bulk-load, with a commensurate
- improvement in speed.
- Sm_BeginIndexLoad(\ ), if it is used, must be the first operation performed on
- an index.
- .sp 2
- .(b L
- \fBint sm_EndIndexLoad(ndx)
- IID* ndx; /* IN index identifier */ \fR
- .)b
- Sm_EndIndexLoad(\ ) concludes the bulk-load and builds the index.
- .sp
- .(b L 2
- \fBint sm_AbortIndexLoad(ndx)
- IID* ndx; /* IN index identifier */ \fR
- .)b
- sm_AbortIndexLoad(\ ) aborts the bulk-loading of an index.
- All resources used by the index are freed.
- .br
- .sh 3 "Scanning Indexes"
- .lp
- Indexes are used by posing queries with the sm_FetchInit(\ ) operation.
- A query requests all the elements whose key values lie in a range.
- The results of the query are fetched, one element at a time, with the
- sm_FetchNext(\ ) operation.
- An index scan uses a \fIcursor\fR, a value of the type SMCURSOR.
- .(x z
- cursor
- .)x \*($n
- A cursor can be treated by the application as an opaque value.
- The following two macros give a cursor an invalid
- initial value and recognize that value:
- .(b I
- \fBINVALIDATE_CURSOR (SMCURSOR cursor)\fR
- .)b
- sets \*(lqcursor\*(rq to an invalid index scan cursor.
- .(b I
- \fBCURSOR_IS_INVALID (SMCURSOR cursor)\fR
- .)b
- returns TRUE if \*(lqcursor\*(rq is the value given
- by INVALIDATE_CURSOR(\ ), FALSE if not.
- .lp
- The rest of this section describes the functions used to scan
- indexes.
- .sp
- .(b L
- \fBsm_FetchInit(ndx, groupIndex, bound1, cond1, bound2, cond2, cursor)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- KEY* bound1; /* IN starting bound of the scan */
- SMCOND cond1; /* IN starting condition */
- KEY* bound2; /* IN ending bound of the scan */
- SMCOND cond2; /* IN ending condition */
- SMCURSOR* cursor; /* OUT returned pointer if non-NULL */\fR
- .)b
- .(x z
- scanning an index
- .)x \*($n
- .(x z
- index scan
- .)x \*($n
- .(x z
- sm_FetchInit(\ )
- .)x \*($n
- .lp
- Sm_FetchInit(\ ) begins a scan on the index \*(lqndx\*(rq.
- The arguments \*(lqbound1\*(rq and \*(lqcond1\*(rq specify the
- beginning search condition.
- \*(lqBound2\*(rq and \*(lqcond2\*(rq specify the ending search condition.
- The conditions can be SM_EQ, SM_G, SM_L, SM_GEQ, or SM_LEQ.
- The \*(lqcursor\*(rq argument is initialized by
- sm_FetchInit(\ ) and used by sm_FetchNext(\ ).
- The caller is responsible for allocating the space for the cursor
- and the client library is responsible for the value of the cursor.
- .sp
- The direction of the scan (ascending or descending) is
- determined by the bounds and conditions of the query.
- The beginning and end of an index are specified with the
- macros SM_BOF and SM_EOF.
- For linear hashing indexes (type SM_HASHNDX), the value that
- .(x z
- index query
- .)x \*($n
- .(x z
- query, index
- .)x \*($n
- can be used for \*(lqcond1\*(rq and \*(lqcond2\*(rq is SM_EQ.
- .sp
- Several examples of queries follow:
- .np
- Scan from key1 = \*(lq10\*(rq to key2 = \*(lq30\*(rq inclusively:
- .br
- sm_FetchInit( ..., key1, SM_GEQ, key2, SM_LEQ, cursor) --- ascending
- .br
- sm_FetchInit( ..., key2, SM_LEQ, key1, SM_GEQ, cursor) --- descending
- .sp
- .np
- Scan from key1 = \*(lq10\*(rq to the end of the index:
- .br
- sm_FetchInit( ..., key1, SM_GEQ, SM_EOF, cursor) --- ascending
- .br
- sm_FetchInit( ..., SM_EOF, key1, SM_GEQ, cursor) --- descending
- .sp
- .np
- Scan the whole index:
- .br
- sm_FetchInit( ..., SM_BOF, SM_EOF, cursor) --- ascending
- .br
- sm_FetchInit( ..., SM_EOF, SM_BOF, cursor) --- descending
- .lp
- .sp 2
- .(b L
- \fBsm_FetchNext(cursor, retKey, retElem, eof)
- SMCURSOR* cursor; /* IN cursor from sm_Fetch(\ ) */
- KEY* retKey; /* OUT returned key (optional) */
- void* retElem; /* OUT elem */
- BOOL* eof; /* OUT to TRUE if EOF reached */\fR
- .)b
- .(x z
- sm_FetchNext(\ )
- .)x \*($n
- .lp
- Sm_FetchNext(\ ) fetches the next element returned by a query.
- The element is returned in the structure addressed by \*(lqretElem\*(rq.
- A copy of the key can also be returned to the caller.
- If \*(lqretKey\*(rq is NULL, no key is returned.
- If \*(lqretKey\*(rq points to a key structure, the key is returned in that
- structure.
- The \*(lqlength\*(rq field in the key structure must indicate
- amount of space available in the
- target of the \*(lqvaluePtr\*(rq field.
- This must be enough for the longest key in the index.
- The caller is responsible for allocating space for \*(lqretKey\*(rq and \*(lqretElem\*(rq.
- .lp
- sm_FetchNext(\ ) returns FALSE in \*(lqeof\*(rq if an element is returned.
- If there are no more elements that satisfy the query,
- TRUE is returned in \*(lqeof\*(rq.
- .sp
- .br
- .sh 2 "Advanced Topics"
- .sh 3 "External Two-Phase Commit Functions"
- .(x z
- two-phase commit functions, external
- .)x \*($n
- .(x z
- two-phase commit protocol, external
- .)x \*($n
- .(x z
- transactions, distributed
- .)x \*($n
- .(x z
- distributed transactions
- .)x \*($n
- .lp
- The Storage Manager can particpate in transactions coordinated by
- other software modules that employ the
- two-phase commit \*(lqpresumed abort\*(rq transaction semantics and protocol.
- (For the purpose of this section, the reader is assumed to be familiar
- with the \*(lqpresumed abort\*(rq protocol.)
- The coordinator in such a situation is external to the Storage Manager;
- it is assumed to have its own stable storage, and it is assumed
- to recover from failures in a \fIshort time\fR (the precise meaning
- of which is given forthwith).
- .lp
- A prepared transaction, like an active transaction, consumes log space on
- one or more Exodus servers,
- beginning at a fixed location in each log.
- A Storage Manager server's log is like a circular buffer; it
- wraps and reuses the beginning of the log.
- If long-running or prepared transactions are still in the system,
- the server eventually tries to re-use log space
- consumed by the oldest transaction, at which point it
- effectively runs out of log space.
- A coordinator must resolve its prepared transactions before the servers
- run out of log space. The amount of time involved is a function of
- the size of the log on the participating servers and the load on those
- servers.
- .lp
- For the purpose of this discussion,
- the portion of a global transaction that involves a single
- Exodus Storage Manager transaction is called a \fIthread\fR
- .(x z
- thread
- .)x \*($n
- of the global transaction.
- Each thread has, in addition to its
- local transaction identifier, a global transaction identifier.
- Global transaction identifiers are provided by the application or
- some external authority, and must be unique.
- A global transaction identifier has type GTID, defined in
- \fCsm_client.h\fR, as follows:
- .(b I
- \fB#define MAXOPAQUELEN 255
- \fBtypedef struct {
- int length; /* maximum MAXOPAQUELEN bytes */
- u_char opaque[MAXOPAQUELEN];
- } GTID;
- .)b
- .(x z
- transaction identifier, global
- .)x \*($n
- .(x z
- global transaction identifier
- .)x \*($n
- .(x z
- GTID
- .)x \*($n
- .lp
- The Storage Manager does not interpret the contents of the opaque
- part of the global transaction identifier.
- .lp
- An application that invokes the external two-phase commit protocol can find
- itself in any of the transaction states mentioned in
- Section 4.3.2 (\*(lqTransaction States\*(rq).
- It can also find itself in the PREPARED state
- after a call to sm_PrepareTransaction(\ ).
- An application in PREPARED state calls
- sm_CommitTransaction(\ ) or sm_AbortTransaction(\ )
- to complete the transaction and return to the INACTIVE state.
- .lp
- While the coordinator for a global transaction is external to
- the Storage Manager, a single Storage Manager server corresponds with
- the client library and coordinates
- the Storage Manager servers that participate in the thread.
- If the application should crash during a two-phase commit,
- a new application program (representing the global coordinator)
- must run, and it must contact the Storage Manager that is acting
- as the thread's coordinator.
- In order to locate the proper server, a
- two-phase commit process begins by
- informing the client library that a transaction is a thread
- of a global transaction, and by identifying the thread's coordinator.
- The function sm_Enter2PC(\ ), described below, accomplishes this.
- .sp
- .(b L
- \fBsm_Enter2PC (tid, gtid, handle)
- TID tid; /* IN transaction ID */
- GTID *gtid; /* IN global transaction ID */
- COORD_HANDLE *handle; /* OUT for use if client crashes */
- \fR
- .)b
- .(x z
- sm_Enter2PC(\ )
- .)x \*($n
- .lp
- The application supplies the local and global transaction identifiers.
- The client library identifies a thread coordinator, and produces
- a handle for the application to write to stable storage.
- The handle identifies the thread coordinator; it is used
- by sm_Recover2PC(\ ) if the client crashes before the two-phase
- commit is completed.
- .lp
- The handle must be written to stable storage before the
- first phase of the commit begins,
- otherwise the application and Storage Manager
- may not be able to recover from a subsequent application failure.
- .sp
- .(b L
- \fBsm_PrepareTransaction (tid, vote)
- TID tid; /* IN transaction ID */
- VOTE *vote; /* OUT result of first phase */
- \fR
- .)b
- .(x z
- sm_PrepareTransaction(\ )
- .)x \*($n
- .lp
- The application calls sm_PrepareTransaction(\ ) to begin the first, or
- prepare, phase of a two-phase commit.
- sm_PrepareTransaction(\ ) determines if
- the participating servers are able to commit the transaction,
- and directs them to prepare to commit if they are.
- If any of the participating servers is unable to commit the
- transaction, the vote returned is NOVOTE,
- sm_PrepareTransaction(\ ) sets sm_error to esmTRANSABORTED,
- sm_reason to esmTRANSNOTPREPARED,
- and returns esmFAILURE;
- the application must call sm_AbortTransaction(\ ).
- .lp
- If all participating servers are able to commit, and any of them
- logged updates during the transaction, the vote is YESVOTE,
- and
- the transaction state becomes PREPARED.
- If the transaction did not update any data on any of the servers,
- the vote is READVOTE, and the transaction state becomes INACTIVE.
- Sm_PrepareTransaction(\ ) returns esmNOERROR if the transaction
- becomes prepared (all servers vote YESVOTE) or committed
- (all server vote READVOTE).
- .lp
- If an error occurs during the prepare phase,
- sm_PrepareTransaction(\ ) returns esmFAILURE.
- If it is a recoverable error,
- the client library returns an error code specific to the error
- in sm_errno (such as esmTRANSDISABLED if a server is performing recovery),
- and the application
- can try again to call sm_PrepareTransaction(\ ).
- Some errors, on the other hand, cause the transaction to be aborted,
- in which case sm_PrepareTransaction(\ ) returns esmTRANSABORTED in
- sm_errno, and a vote of NOVOTE.
- .(x z
- vote, two-phase commit
- .)x \*($n
- .(x z
- esmTRANSABORTED
- .)x \*($n
- .lp
- If an application crashes during the first phase, the application
- must retry the prepare phase and complete the transaction.
- If it does not retry the prepare phase, and
- the transaction was indeed prepared before the
- application crashed,
- the prepared transaction consumes resources indefinitely,
- and eventually its servers will run out of log space.
- .lp
- Once a transaction is prepared, an application must invoke the
- second phase by aborting or committing
- the transaction
- (calling sm_AbortTransaction(\ ) or sm_CommitTransaction(\ ), respectively).
- It is an error to commit a global transaction thread
- without first preparing the transaction, and it is an error
- to do anything else without completing the second phase.
- .lp
- When an error occurs during the second phase, the application cannot
- tell if the second phase completed (the transaction indeed committed
- or aborted).
- It is alway safe to try again to complete the transaction
- by calling sm_AbortTransaction(\ ) or sm_CommitTransaction(\ ) again.
- .lp
- If the second phase fails because the network connection between
- the client and the thread coordinator breaks
- (esmSERVERDIED or esmNOTCONNECTED), the
- client must reconnect to the thread coordinator before the second
- phase can be finished.
- The following function does that:
- .sp
- .(b L
- \fBsm_Continue2PC (tid, willing2block)
- TID tid; /* IN transaction ID */
- BOOL willling2block; /* IN ok to block indefinitely */
- .)b
- .(x z
- sm_Continue2PC(\ )
- .)x \*($n
- .lp
- If \*(lqwilling2block\*(rq is TRUE,
- the client library blocks until it connects to the thread
- coordinator.
- If this is inappropriate for the application, \*(lqwilling2block\*(rq
- must be FALSE, and the client library tries once to contact
- the thread coordinator.
- .lp
- If the application crashes, its replacement
- must use sm_Recover2PC(\ ), below, instead of sm_Continue2PC(\ ) to
- resolve the transaction.
- .sp
- .(b L
- \fBsm_Recover2PC (gtid, handle, willing2block, tid)
- COORD_HANDLE *handle; /* IN handle for thread coordinator */
- GTID *gtid; /* IN global transaction ID */
- BOOL willing2block; /* IN ok to block indefinitely */
- TID *tid; /* OUT local transaction ID */\fR
- .)b
- .(x z
- sm_Recover2PC(\ )
- .)x \*($n
- .lp
- When the application crashes (exits) after a transaction
- is prepared but before its second phase is completed,
- a \*(lqrecovery\*(rq application program must be run within a short time
- to finish the two-phase commit and resolve the transaction.
- This recovery application must use sm_Recover2PC(\ ), supplying
- the global transaction identifier and
- the handle returned by sm_Enter2PC(\ ) for that global transaction.
- The client library contacts the server identified in the handle,
- which conveys to the client library all that is needed for the
- application to enter or to retry the second phase.
- The transaction's local transaction identifier is
- returned by sm_Recover2PC(\ )
- for the application to use in its subsequent call to
- sm_CommitTransaction(\ ) or sm_AbortTransaction(\ ).
- .lp
- The thread coordinator may not be available, in which case the
- client library keeps trying to connect or it will
- return an error (such as ECONNREFUSED), depending on the value
- of \*(lqwilling2block\*(rq.
- If \*(lqwilling2block\*(rq is FALSE, the client library tries only once
- to connect the thread coordinator.
- .br
- .sh 3 "Administrative Operations"
- .lp
- The following functions can be applied to one or more
- servers.
- Each function takes two arguments that determine which
- servers are of interest.
- The first argument is of type FLAGS, and takes one of the
- following values:
- .sp
- .(b
- \fRVOL_ALL /* the servers for all volumes */
- VOL_USED_SINCE_INIT /* servers for all volumes used */
- VOL_USED_IN_TRANSACTION /* servers used in this transaction */
- VOL_BY_VOLID /* the second argument applies */
- .)b
- The client library keeps a list of volumes and the
- servers that manage those volumes.
- The list is created from the information given in the
- configuration files and information passed to the library
- .(x z
- configuration files
- .)x \*($n
- through sm_SetClientOption(\ ),
- The flag VOL_ALL
- directs the client library to apply
- the administrative operation to
- the server that manages each volume in its list of known volumes.
- The flag VOL_USED_SINCE_INIT
- directs the client library to apply the administrative operation to
- each server contacted since sm_Initialize(\ ) was called.
- The flag VOL_USED_IN_TRANSACTION
- directs the client library to apply the administrative operation to
- each server contacted so far for participation in the current transaction.
- (It does not apply to servers to be contacted for the
- first time later in the transaction.)
- The flag VOL_BY_VOLID
- directs the client library to apply the administrative operation to
- the server that manages the volume identified by the
- second argument.
- The second argument is a volume identifier VOLID, which is ignored
- when the flags argument is VOL_ALL, VOL_USED_SINCE_INIT, or
- VOL_USED_IN_TRANSACTION.
- .lp
- Ideally the administrative operations would
- only be performed by trusted clients,
- but the Storage Manager does not restrict their use.
- .sp
- .(b L
- \fBsm_TakeCheckpoint (flags, volid, numCheckpoints)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- short numCheckpoints; /* IN number of checkpoints to take */\fR
- .)b
- .(x z
- sm_TakeCheckpoint(\ )
- .)x \*($n
- .lp
- Sm_TakeCheckpoint(\ ) sends a request to the server to take a
- number of checkpoints.
- In most circumstances, a value of
- one for the \*(lqnumCheckpoints\*(rq argument is appropriate.
- A value greater than 1 can be used to ensure
- that the server flushes all pages that were dirty when the first
- checkpoint was taken.
- (This is useful for experimenting with the recovery facility).
- .sp
- .(b L
- \fBsm_ChangeCheckpointFrequency (flags, volid, frequency)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- int frequency; /* IN number of log records between checkpoints */\fR
- .)b
- .(x z
- sm_ChangeCheckpointFrequency(\ )
- .)x \*($n
- .lp
- Sm_ChangeCheckpointFrequency(\ ) changes
- the frequency of checkpoints taken by the server.
- The checkpoint frequency is based on the
- number of log pages written.
- .(x z
- checkpoint frequency, changing
- .)x \*($n
- .(x z
- default checkpoint frequency
- .)x \*($n
- More information about checkpoint frequency can be found in
- Section 5.3, \fBTuning the Server\fR.
- .sp
- .(b L
- \fBsm_ShutdownServer (flags, volid, options)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- FLAGS options; /* IN shutdown options */\fR
- .)b
- .(x z
- sm_ShutdownServer(\ )
- .)x \*($n
- .lp
- Sm_ShutdownServer(\ ) directs servers to shut down.
- The \*(lqoptions\*(rq argument indicates what a server should do before exiting.
- The following flags are available: NOFLAGS,
- SHUT_TAKE_CHECKPOINT, SHUT_DUMP_CORE, SHUT_ABORT_TRANS,
- SHUT_COMMIT_TRANS, SHUT_CLEAN_VOLUMES.
- These can be combined with the logical \*(lqor\*(rq operator.
- .lp
- If NOFLAGS is given, the server
- kills the disk processes and exits.
- .lp
- SHUT_TAKE_CHECKPOINT directs the server to take a checkpoint before exiting.
- .lp
- SHUT_DUMP_CORE directs the server to dump a core file debugging (see core(5)).
- .lp
- SHUT_COMMIT_TRANS directs the server to wait until the
- running transactions
- either commit or abort before it shuts down.
- .lp
- SHUT_ABORT_TRANS directs the server to abort all running transactions
- before shutting down.
- When SHUT_COMMIT_TRANS or SHUT_ABORT_TRANS is used, clients
- cannot start any new transactions.
- .lp
- SHUT_CLEAN_VOLUMES directs the server to
- write dirty pages to disk before exiting.
- To shut down a server after which recovery is not required,
- use either
- SHUT_COMMIT_TRANS | SHUT_CLEAN_VOLUMES or
- SHUT_ABORT_TRANS | SHUT_CLEAN_VOLUMES.
- .sp
- .(b L
- \fBsm_ServerStatistics (flags, volid, numServers, stats, reset)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- int *numServers; /* OUT # servers contacted */
- SERVERSTATS **stats; /* OUT servers' statistics */
- BOOL reset; /* IN TRUE = reinitialize counters */\fR
- .)b
- .(x z
- sm_ServerStatistics(\ )
- .)x \*($n
- .lp
- Sm_ServerStatistics(\ )
- obtains statistics about one or more servers.
- For each server contacted, a set of statistics is returned.
- The client library allocates space for the statistics, and the
- \fBapplication is responsible for freeing that space\fR
- ( see the manual page for malloc(3) ).
- The \*(lqflags\*(rq indicate which servers are of interest,
- and the number of servers contacted is returned in
- \*(lq*numServers\*(rq.
- On return from sm_ServerStatistics(\ ),
- the \*(lq*stats\*(rq pointer addresses an array of
- \*(lq*numServers\*(rq SERVERSTATS structures.
- This array must be freed by the application with one call to
- \fIfree(3)\fR.
- .lp
- If \*(lqreset\*(rq is TRUE, the statistics labeled as counters below are
- reset to zero.
- .lp
- The SERVERSTATS structure looks like this:
- .(b I
- \fBtypedef struct {
- int numClients; /* # clients connected */
- int numTrans; /* # transactions in progress */
- int numVolumes; /* # volumes mounted */
- int freeLogSpace; /* approximate # bytes free log space */
- int chpntFreq; /* checkpoint frequency */
- int totalCommits; /* # transactions committed */
- int totalAborts; /* # transactions aborted */
- int diskReads; /* # disk reads */
- int diskWrites; /* # disk writes */
- MESSAGESTATS msgStats; /* server's message counters */
- } SERVERSTATS;
- .)b
- .(x z
- MESSAGESTATS
- .)x \*($n
- .(x z
- SERVERSTATS
- .)x \*($n
- .lp
- The MESSAGESTATS structure contains statistics about
- the client-server protocol and the server-server protocol.
- A set of these statistics is kept by the client library
- a set is kept by each server.
- The client library's statistics are found in the global
- structure
- .(b L
- \fBextern MESSAGESTATS MsgStats;\fR
- .)b
- The MESSAGESTATS structure contains the following counters
- for each message type: messages sent, messages received,
- replies received with an error indication,
- replies received with no error,
- messages sent with no reply requested.
- The counters for replies have two different meanings, depending
- on which set statistics is concerned.
- The servers count the replies \fIsent\fR
- with and without error
- indications, and the number of requests that the
- server \fIreceived\fR that did not require a reply at all.
- The client library counts the replies \fIreceived\fR
- with and without error indications, and the number of requests that the
- client \fIsent\fR that did not require a reply at all.
- .lp
- The following function
- prints the MESSAGESTATS structure:
- .sp
- .(b L
- \fBsm_PrintMessageStats (file, stats)
- FILE *const file; /* IN where to print */
- MESSAGESTATS *const msgStats; /* IN what to print */
- .)b
- .(x z
- sm_PrintMessageStats(\ )
- .)x \*($n
- .lp
- The following function tells if a mounted volume is
- temporary volume, a data volume, or a log volume.
- .(x z
- temporary volume
- .)x \*($n
- See
- Section 5.1, \fBManaging Volumes\fR, for information
- about volumes.
- .sp
- .(b L
- \fBsm_VolumeProperties (volid, properties)
- VOLID volid; /* IN which volume is of interest */
- int *properties; /* OUT the properties */
- .)b
- .(x z
- sm_VolumeProperties(\ )
- .)x \*($n
- .lp
- Sm_VolumeProperties(\ ) returns a set of bits
- that tell whether the given volume is a data
- volume or a temporary volume.
- The \*(lqvolid\*(rq argument is the volume identifier of
- the volume in question.
- If the volume is not mounted when Sm_VolumeProperties(\ )
- is called, Sm_VolumeProperties(\ ) mounts it.
- .lp
- VOLPROP_TEMP indicates that the volume is temporary
- .(x z
- temporary volume
- .)x \*($n
- (see
- Section 5.1.3, \fBTemporary Volumes\fR).
- If the bit VOLPROP_TEMP is not set in the result,
- the volume is a data volume.
- A log volume cannot be mounted by a client, and
- an attempt to get a log volume's properties results
- in an error.
- .sp
- .(b L
- \fBsm_AddServerVolume (flags, volid, option, value)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which volume is of interest */
- char *option; /* IN which format option to use */
- char *value; /* IN value for the format option */
- .)b
- .(x z
- sm_AddServerVolume(\ )
- .)x \*($n
- .lp
- Sm_AddServerVolume(\ ) adds a volume to the list of mountable volumes
- on one or more servers
- (although it seldom makes sense to do this on more than one server
- with a single pair of arguments).
- The \*(lqflags\*(rq argument indicates which servers are of interest.
- The \*(lqvolid\*(rq argument is the volume identifier of
- the volume that will determine which server to contact when
- \*(lqflags\*(rq == VOL_BY_VOLID.
- The \*(lqoption\*(rq is one of
- the server's format options
- (\*(lqdataformat\*(rq or \*(lqtempformat\*(rq).
- The \*(lqvalue\*(rq
- argument is the value to be given the option named in \*(lqoption\*(rq.
- .lp
- Sm_AddServerVolume(\ ) adds the named volume to the server's list
- of known volumes, but the server does not try to mount the volume
- or verify that the volume exists or is valid.
- Sm_AddServerVolume(\ ) fails
- if the value given conflicts with
- another volume already in the server's table,
- either in the path name or the volume identifier.
- If your objective is to change the format information
- for a path name that is in the server's table,
- first remove the existing format information
- (using sm_RemoveServerVolume(\ ), described below),
- and subsequently add the new information.
- .sp
- .(b L
- \fBsm_RemoveServerVolume (flags, volid, volid2remove)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which volume id of server of interest */
- VOLID volid2remove; /* IN which volume to remove */
- .)b
- .(x z
- sm_RemoveServerVolume(\ )
- .)x \*($n
- .lp
- Sm_RemoveServerVolume(\ ) removes
- \*(lqvolid2remove\*(rq from one or more servers' lists of mountable volumes.
- The volume cannot be removed from a server's table while the volume
- is in use.
- it must be dismounted before it is removed.
- .lp
- See also
- Section 5.1, \fBManaging Volumes\fR.
- .br
- .sh 3 "Tuning the Application"
- .lp
- The size of the application's buffer pool,
- determined by the \*(lqbufpages\*(rq option,
- is the primary tuning parameter that
- is under the control of applications.
- The \*(lqbufpages\*(rq option indicates the number of
- MIN_PAGESIZE
- pages in the buffer pool.
- It should be set large enough
- to hold the application's working set of objects.
- The buffer pool must not exceed the size of physical memory available to
- the client.
- .bp
- .sh 1 "USING STORAGE MANAGER SERVERS"
- .lp
- Storage Manager servers provide disk, file, transaction,
- concurrency control, and recovery services to clients.
- In most respects, users do not have to understand how servers
- work, but there are a few things that administrators should know;
- we focus on those things in this section.
- The first half of this section explains how to manage volumes.
- The second half explains how to operate a server.
- .sp
- .sh 2 "Managing Volumes"
- .lp
- Servers store data on \fIvolumes\fR,
- .(x z
- volume
- .)x \*($n
- .(x z
- files, Unix
- .)x \*($n
- .(x z
- partition
- .)x \*($n
- which can be Unix files or raw disk partitions.
- Each server is composed of a server process
- and one \fIdisk process\fR for each mounted volume.
- .(x z
- disk process
- .)x \*($n
- When a server requires I/O, it asks the appropriate disk
- process to read from or write to the server's buffer pool, which
- is located in a Unix System V shared-memory segment.
- The disk processes perform I/O so
- that the server never blocks when I/O is required.
- The server mounts a volume before using it, and the server
- dismounts the volume when it is no longer in use.
- Mounting a volume consists in forking a disk process for that volume.
- Dismounting the volume consists in flushing all dirty pages to the disk
- and killing the volume's disk process.
- .lp
- Volumes are created with the \fCformatvol\fR program, which
- establishes a volume's identifier, size, type, and other characteristics.
- Volumes come in three types: log volumes, data volumes, and temporary volumes.
- .(x z
- temporary volume
- .)x \*($n
- .sh 3 "Log Volumes"
- .lp
- Log volumes are used to store log information for
- aborting transactions and for recovery.
- The server has one log volume mounted at all times.
- .sh 3 "Data Volumes"
- .lp
- Data volumes are used to store objects and indexes
- that are meant to exist after a transaction ends.
- Changes to data volumes are logged so that transactions
- can be aborted or committed with reliability, and
- so that recovery can be performed after a crash.
- .sh 3 "Temporary Volumes"
- .lp
- Some applications store temporary private data
- and do not need concurrency control or recovery.
- The Storage Manager provides temporary volumes for this purpose.
- .(x z
- temporary volume
- .)x \*($n
- Locks are not acquired for data in temporary volumes,
- and updates to temporary volumes are not logged.
- Temporary volumes are less costly to use than data volumes are,
- but the data on them cannot be shared among transactions.
- The data on temporary volumes are deleted at the conclusion
- of the transaction that creates them, regardless of
- whether the transaction is committed or aborted.
- Temporary volumes cannot contain root entries.
- .lp
- The server can serve many data volumes and
- temporary volumes simultaneously.
- .sh 3 "Raw Partitions and Unix Files"
- .lp
- A volume can be a Unix file or a Unix raw partition.
- When a raw partition is used,
- data are transferred between the server's buffer pool and the disk
- by the disk process, bypassing the Unix file system's buffer pool.
- .lp
- When a Unix file is used, the data are written to the Unix
- file system's buffer pool, and the operating system worries about
- flushing the data to the disk.
- In this case,
- the server forces the data to the disk periodically with
- a Unix \fIfsync(\ )\fR system call.
- .br
- .sh 3 "Formatting Volumes"
- .lp
- Before a volume can be used, it must be formatted.
- This is done using the \fCformatvol\fR program, which can also
- display information about previously formatted volumes.
- Formatvol uses
- .(x z
- configuration options
- .)x \*($n
- the configuration options
- \*(lqdataformat\*(rq,
- \*(lqtempformat\*(rq, and
- \*(lqlogformat\*(rq
- to determine what characteristics to give volumes that it formats.
- The options have values that list the following
- information:
- .ip "path" 10
- The Unix path name of the volume, e.g., \fC/dev/rz2c\fR.
- .ip "volid" 10
- The volume identifier for this volume, an integer, e.g., 8000.
- .ip "#cyl" 10
- The number of cylinders on this disk, e.g., 1224 for a DEC RZ55.
- May be 1.
- .ip "#trk/cyl" 10
- The number of tracks per cylinder e.g., 15 for a DEC RZ55.
- May be 1.
- .ip "#sect/trk" 10
- The number of sectors or blocks per track e.g., 36 for a DEC RZ55.
- May be the number of \fIblocks\fR in the file.
- .(x z
- block in a file
- .)x \*($n
- A block is MIN_PAGESIZE bytes;
- MIN_PAGESIZE is defined in \fCsm_client.h\fR.
- (This is determined by the Storage Manager, not by the device.)
- \**
- .(f
- \**
- The format of a volume does not affect performance with
- most modern disks.
- The easiest way to format volumes it to use
- use 1 cyl, 1 track/cyl, and let the sect/trk account
- for the size of the entire volume.
- .)f
- .ip "#KB/pg" 10
- \fBFor logformat only\fR.
- This gives the page size for log pages, in kilobytes.
- The value given here may be 4 or larger, and must be a power of 2.
- .lp
- Formatvol collects the format information from the options
- in the configuration files,
- after which it determines which volumes to format or to display
- by processing the options
- \*(lqvolume\*(rq and \*(lqdisplay\*(rq from the command line.
- The options that formatvol understands are summarized in Table 2.
- .(b
- .TS
- box, center, tab(;);
- c|c|c
- c|c|c
- l|l|l.
- Option;Option;Option
- Name;Type;Description
- _
- tempformat;string,int,int,int;path,volid,#cyl,#trk/cyl,#sect/trk
- dataformat;string,int,int,int;path,volid,#cyl,#trk/cyl,#sect/trk
- logformat;string,int,int,int,int;path,volid,#cyl,#trk/cyl,#sect/trk,#KB/pg
- volume;int;volume to format - command line only
- display;int;volume to display - command line only
- .TE
- .ce 2
- \fBTable 2: Formatvol Options.\fR
- Fields are separated by white space, commas, colons or semicolons.
- .)b
- .(x z
- options, formatvol
- .)x \*($n
- .sp
- For example,
- to print information about the volumes with volids 8000 and 4000 use:
- .(b I
- \fCformatvol -dis 8000 -dis 4000\fR
- .)b
- .lp
- To format a data volume with volid 8000 and
- a temporary volume with volid 4000 use:
- .(x z
- temporary volume
- .)x \*($n
- .(b I
- \fCformatvol -vol 8000 -vol 4000\fR
- .)b
- .lp
- Formatting a volume writes a volume header and initializes
- the bitmaps that describe the free blocks on the volume.
- A volume that is reformatted after being used loses all its data.
- .lp
- The Storage Manager does not prevent a volume from being formatted
- while it is in use by a server, even though \fBit will cause
- the server to crash unrecoverably\fR.
- Be certain that a volume is not mounted before you format it! \**
- .(f
- \** The Storage Manager ought to lock volumes with Unix file locks,
- but Unix does not provide an adequate mechanism for locking and
- unlocking files in the context of crash recovery.
- .)f
- A volume is unmounted when all clients that are using the volume
- have completed transactions on it and have unmounted it.
- (A client may unmount a volume explicitly with sm_DismountVolume(\ ),
- or by shutting down with sm_ShutDown(\ ) or \fIexit(\ )\fR.)
- .lp
- During recovery,
- a server mounts the volumes that need recovery.
- The volumes are dismounted when recovery is completed.
- If a volume was in use at the time its server crashed,
- \fBdo not reformat the volume until a new server recovers the
- data on that volume\fR.
- If you do, the server's log will be inconsistent with the data
- on the volume, and the server will crash during recovery, and
- it will be unable to recover from that.
- You can reformat the data volumes and the log volume to get
- a server running again, but you will have lost all data on the volumes.
- .lp
- The log volume is mounted whenever the server is running,
- so a log volume can be formatted ONLY when the server is not running.
- .br
- .sh 3 "Size Requirements for Log Volumes"
- .lp
- How large should a log volume be?
- .(x z
- log volume, size of
- .)x \*($n
- .(x z
- log space
- .)x \*($n
- The answer depends on the expected transaction mix.
- More specifically, it depends on the age of the oldest
- (longest running) transaction
- in the system and the amount of
- log space used by all active transactions.
- Here are some general rules to determine
- the amount of free log space available in the system.
- .np
- The physical log is circular.
- Log space between the first log record generated
- by the oldest active transaction
- and
- the most recent log record generated by any transaction cannot
- be reused.
- .np
- Log space for a transaction is available for reuse when the
- transaction has committed or completely aborted.
- Aborting a transaction causes log space to be used, so
- space is \fIreserved\f for aborting each transaction.
- Enough log space must be
- available to commit \fIor abort\fR all active transactions
- at all times.
- .np
- Only space starting at the \fIbeginning\fR of the log can be reused.
- This space can be reused if
- it contains log records only for transactions meeting rule 2.
- .np
- All sm_WriteObject(\ ) calls require log space twice the
- size of the space written in the object.
- All calls that create, grow, or shrink objects
- require log space equal to the size created, inserted, or deleted.
- Log records generated by these calls (generally one per call)
- have an overhead of approximately 50 bytes.
- .np
- File operations are logged, but the space requirements for
- them are most often negligible, since they are relatively rare
- operations, and are often performed in short transactions.
- .np
- The amount of log space \fIreserved\fR for aborting a
- transaction is equal to
- the amount of log space generated by the transaction (for
- the purpose of committing the transaction).
- .np
- When insufficient log space is available
- for a transaction, the transaction is aborted.
- .np
- The log should be at least 1 Mbyte (250 pages).
- .lp
- For example, consider a transaction T1, which creates 300
- objects of size 2,000 bytes, writes 20 bytes in 100 objects,
- and is committed.
- T1 requires at 615 Kbytes for the creates
- and 9 Kbytes of log space for the writes.
- Since log space must be reserved to abort the transaction,
- the log size must be over 1.248 Mbytes to run this transaction.
- Assuming T1 is the only transaction running in the system,
- all the log space it uses and reserves becomes available when it
- completes.
- If another transaction, T2, is started at the same time
- as T1, but is still running after T1 is committed, only the
- reserved space for T1 is available for other transactions. The
- portion of the log
- used by T1 and T2 is not available until T2 is finished.
- .lp
- Transactions that fail because of insufficient log space are commonly
- those that load a large number of objects into
- a file during the
- creation of a database.
- A solution to this problem is to load the
- file in a series of smaller transactions.
- When the last transaction is committed, the load is complete.
- If the load needs to be aborted, a
- separate transaction is run to destroy the file.
- .br
- .sh 3 "Backing Up Volumes"
- .lp
- The Storage Manager does not support media recovery,
- .(x z
- volumes, backing up
- .)x \*($n
- so backing up critical data volumes is wise.
- A volume may be backed up when it is unmounted and needs no recovery.
- If a volume is stored on a Unix file, a simple copy
- of the file can be used as a backup.
- For volumes stored on a raw disk partition,
- the Unix \fIdd(1)\fR command can be used to backup the volume to a
- Unix file and to restore it.
- For example, to save a copy of the raw device
- \fC/dev/rrz4d\fR in the Unix file backup.rrz4d use:
- .(b
- \fCdd if=/dev/rrz4d of=backup.rrz4d\fR.
- .)b
- To restore the backup, use:
- .(b
- \fCdd if=backup.rrz4d of=/dev/rrz4d\fR.
- .)b
- .sp
- .sh 2 "Using the Server"
- .lp
- In this section we explain how to operate a Storage Manager server.
- For the purpose of this discussion, we use only one server,
- although any number of servers can be used to manage any number
- of volumes.
- We begin with starting and configuring the server.
- Next, we discuss what the server does during normal operation.
- We follow this with instructions for shutting the server down.
- Finally, we explain how the server recovers from failure.
- .sh 3 "Starting the Server"
- .lp
- The server is composed of
- two executable files:
- \fCsm_server\fR and \fCdiskrw\fR.
- .(x z
- disk process
- .)x \*($n
- \fCSm_server\fR is the main server program.
- \fCDiskrw\fR is started by the server,
- as a separate process for each mounted volume,
- for performing asynchronous disk I/O.
- These processes communicate with the server through sockets, semaphores,
- and shared memory.
- By default, the server assumes \fCdiskrw\fR is located in the user's path.
- .(x z
- default path for diskrw
- .)x \*($n
- An option, described below, can be used to change this assumption.
- .lp
- When the server is started, it processes configuration options.
- .(x z
- configuration options
- .)x \*($n
- These options are discussed further below.
- Second, the server allocates the buffer pool.
- The buffer pool is located in shared memory, so the operating
- system must have shared-memory support.
- Furthermore, the machine on which the server runs
- must have enough shared memory to accommodate the
- entire buffer pool.
- If not enough shared memory is available, the
- server prints a message, indicating how much shared memory
- it is trying to acquire, and exits.
- .lp
- Third, the server mounts the log volume.
- .(x z
- log volume
- .)x \*($n
- .(x z
- regenerating log volume
- .)x \*($n
- .(x z
- log volume, regenerated
- .)x \*($n
- If the log volume is newly formatted, it is \fIregenerated\fR.
- When a log volume is regenerated,
- the entire log is cleared and written to disk.
- This will take noticeable time if the volume is large.
- If the log is not regenerated, recovery analysis is performed.
- .lp
- If no volumes require recovery,
- .(x z
- recovery
- .)x \*($n
- all phases of recovery complete in less than one second.
- If the analysis determines that any volumes require recovery
- (due to a previous failure of some sort:
- operating system failure, machine failure,
- internal error, or because a user killed the server),
- recovery is performed.
- Data volumes that were mounted at the time of the failure
- are remounted,
- updates by committed transactions are restored,
- and all transactions in progress at the time of failure are aborted.
- When recovery is complete,
- the data volumes are dismounted
- and a checkpoint is taken.
- .lp
- The server now begin to process requests from clients.
- .br
- .sh 3 "Configuring the Server"
- .lp
- There are several \fIconfiguration options\fR that
- .(x z
- configuration, options for server
- .)x \*($n
- can be set when the server is started.
- A brief description of the options is given in Table 3.
- Most options have default values, but some do not, and these
- \fImust\fR be given values, either on the command line
- or in a configuration file.
- See
- Section 3 for general information that applies to all options.
- .(z
- .(x z
- default, option values
- .)x \*($n
- .sz -2
- .TS
- box, center, tab(#);
- c|c|c|c|c
- c|c|c|c|c
- l|l|l|l|l.
- Option#Option#Possible#Default#Option
- Name#Type#Values#Values#Description
- _
- config#string#file name#/usr/lib/sm_config#read a configuration file
- ###$HOME/.sm_config#defaults is read unless
- ###./.sm_config#skipdefault is set
- verbose#Boolean#yes no#no#print configuration options
- bufpages#int#> 32#none#number of buffer pool pages
- logvolume#string#path name#none#name of the log volume
- portname#string#name or number#exodussm#port name or port number
- ####for a server; if a name, it
- ####must be in \fC/etc/services\fR
- errorfile#string#file name#- (stderr)#file for errors,
- ####warnings, progress
- regenlog#Boolean#yes no#no#clear the log,
- shutdown#Boolean#yes no#no#shut down after recovery
- ####or regeneration of log
- checkpoints#int#> 1#100#checkpoint frequency
- ####(based on number of log pages)
- diskproc#string#file name#/usr/lib/exodus/diskrw#disk I/O program name
- intercache#Boolean#yes no#yes#allow caching of pages
- ####at the client between
- ####transactions
- progress#Boolean#yes no#no#control progress printing
- maxclients#int#> 0#20#maximum number of
- ####clients to be served
- ####simultaneously
- maxthreads#int#> 1#function(maxclients)#maximum number of
- ####threads.
- traceflags#int#hex number#0x0#set tracing flags.
- ####Available if server is
- ####compiled with -DDEBUG.
- tempformat#string###see Table 2.
- dataformat#string###see Table 2.
- logformat#string###see Table 2.
- maxaddvolumes#int#small number >= 0#0#increases volume table size
- wrapcount#int#>=0#0#starting wrap count for log
- .TE
- .sz +2
- .ce
- .uh "Table 3: Server Options"
- .(x z
- options, server
- .)x \*($n
- .)z
- .lp
- Option values are read from the the default configuration files
- \fC/usr/lib/sm_config\fR, \fC$HOME/.sm_config\fR, and \fC./.sm_config\fR
- in that order, if they exist.
- If the command-line option \*(lqskipdefault\*(rq is given,
- .(x z
- configuration files, skipping defaults
- .)x \*($n
- .(x z
- default configuration files, skipping
- .)x \*($n
- these default files are not read.
- .lp
- Options on the command line are read after the default
- files are read.
- Command-line options are prefixed by a \*(lq-\*(rq.
- In addition to options,
- a server accepts the command-line \fIflags\fR given in Table 4.
- Command-line flags are prefixed by a \*(lq-\*(rq.
- .(z
- .sz -2
- .TS
- box, center, tab(#);
- c|c
- c|c
- l|l.
- Flag#Flag
- Name#Effect
- _
- help#print a message and exit
- skipdefault#do not read default configuration files
- #must be the first argument on the command line
- force#do not confirm log regeneration option
- background#put in background (for use with Bourne shell)
- .TE
- .sz +2
- .ce
- .uh "Table 4: Server Command-Line Flags"
- .(x z
- flags, server command-line
- .)x \*($n
- .)z
- .lp
- When given the
- \*(lqhelp\*(rq flag,
- a server prints a list of the available options and flags,
- and exits.
- .lp
- The \*(lqskipdefault\*(rq flag prevents a server from
- reading the default configuration files.
- It must be the first argument on the command line if it is used.
- .lp
- The \*(lqforce\*(rq flag prevents a server from checking with
- the user before regenerating the log.
- .lp
- The \*(lqbackground\*(rq flag
- causes the server to disconnect from its controlling terminal.
- This flag is available for users who run the server from
- shells that, like the Bourne shell, do not have real job control.
- .lp
- We now describe each option from Table 2.
- .lp
- The \*(lqconfig\*(rq option specifies a configuration file to read after
- default configuration files have been read.
- .(x z
- configuration file, which to read
- .)x \*($n
- This option is effective only on the command line.
- .lp
- The \*(lqverbose\*(rq option is used to turn on and off printing of the
- option values at startup.
- Options are printed to the file specified by \*(lqerrorfile\*(rq option (q.v.).
- .lp
- The \*(lqbufpages\*(rq option indicates the number of MIN_PAGESIZE
- pages to be used for a server's buffer pool.
- The option must be given for a server to run. This option
- determines the size of the shared memory segment allocated by the
- server. The shared memory segment will be MIN_PAGESIZE*bufpages bytes
- long plus a few KB extra.
- Section 5.3, \fBTuning the Server\fR,
- for more information about setting this option.
- .lp
- The \*(lqlogvolume\*(rq option gives the path name of the volume
- that contains the log.
- A value must be given for the log volume.
- .lp
- The \*(lqportname\*(rq option indicates a port number or
- the symbolic name of a port entry in \fC/etc/services\fR.
- The server connects to this port and listens for client requests on it.
- To enable clients to locate a server with a symbolic
- port name, the port name must to present in \fC/etc/services\fR
- on both the client and server machines.
- If no port name is given,
- a server looks for an entry \*(lqexodussm\*(rq, registered for use with TCP,
- in \fC/etc/services\fR.
- .lp
- By using port numbers instead of symbolic names avoids the
- need for entries in \fC/etc/services\fR.
- See the Unix manual page for services(5).
- An example entry for the default
- server name is:
- .(b
- \fCexodussm 1152/tcp # exodus storage manager\fR
- .)b
- .lp
- The \*(lqerrorfile\*(rq option directs server error messages and
- diagnostics to the given file.
- A value of \*(lq-\*(rq means that \fIstderr\fR is used.
- .lp
- The \*(lqregenlog\*(rq option causes the log on the log volume to be
- regenerated.
- \fBThis overwrites all log records, so it
- should not be done unless the server was last shut down cleanly\fR.
- Server automatically regenerate their logs when
- they are started with a newly formatted log volumes.
- When the option is set to \*(lqyes\*(rq, a confirmation is requested.
- The confirmation can be disabled by starting the server with
- the \*(lqforce\*(rq option.
- .lp
- The \*(lqshutdown\*(rq option causes a server to shut down immediately
- after performing recovery or regenerating the log.
- .lp
- The \*(lqcheckpoints\*(rq option sets the checkpoint frequency for
- a server.
- The value represents the number of log pages written
- between checkpoints.
- .lp
- The \*(lqprogress\*(rq option causes a server to print messages
- tracing its progress.
- This is used for debugging; it slows the server.
- .lp
- The \*(lqdiskproc\*(rq option specifies the path name of the disk I/O
- program to be used by the server.
- .lp
- The \*(lqintercache\*(rq option allows experiments to be run
- with and without inter-transaction caching of
- pages on the client.
- .lp
- The \*(lqmaxclients\*(rq option determines the number of clients
- a server can server at any one time.
- Servers create internal tables whose size depends on this value.
- .lp
- The \*(lqmaxthreads\*(rq value, determined by the \*(lqmaxclients\*(rq
- value, should be sufficient, but can be overridden.
- If a server recovers from a failure without running out
- of threads, it has enough threads to handle client requests.
- If numerous distributed transactions are active at the time
- .(x z
- transactions, distributed
- .)x \*($n
- .(x z
- distributed transactions
- .)x \*($n
- of a server failure, it is possible, but unlikely,
- that the server will
- not be able to recover with the default number of threads.
- .lp
- The \*(lqtraceflags\*(rq option is available only with a server
- that was compiled with debugging (the -DDEBUG flag).
- It is useful for programmers who are modifying the Storage
- Manager source code and testing their changes.
- .lp
- The \*(lqdataformat\*(rq, \*(lqlogformat\*(rq, and \*(lqtempformat\*(rq
- options are as described in
- Section 5.1.5, \fBFormatting Volumes\fR.
- Servers can mount and use volumes given in these options.
- .lp
- The \*(lqmaxaddvolumes\*(rq option
- indicates how large the mount table will be.
- The server reads its configuration files, counts the volumes
- named in the format options, and creates a mount table
- large enough to mount this many volumes and \*(lqmaxaddvolumes\*(rq more.
- This is a strict limit to the number of volumes that the server can
- mount (at any one time) as long as it is running.
- The value of \*(lqmaxaddvolumes\*(rq should not be boosted
- frivolously, because the size of the mount table affects
- the amount of shared memory required by the server.
- The default value is 0.
- .lp
- The \*(lqwrapcount\*(rq option is rarely needed.
- The server will tell you if you ever need to set this option.
- It is needed if you add volumes after
- the server starts (maxaddvolumes > 0),
- and a volume that you are add was updated by a server
- running on a log that differs from the current log
- (or the log was regenerated since the added volume was last mounted.)
- .sp
- .sh 3 "Normal Operation of Servers"
- .lp
- During normal operation, servers listen for connections and
- requests from clients and monitor terminal input.
- Error messages are printed on the servers terminals when
- interesting events occur, for example, when a deadlock is
- detected, or a transaction is aborted by a server because of
- a problem such as insufficient log space.
- .sh 4 "Server Commands"
- .lp
- The following commands can be invoked from the standard input to the
- server:
- \*(lqhelp\*(rq,
- \*(lqshutdown\*(rq, \*(lqkill\*(rq, \*(lqcrash\*(rq,
- \*(lqcheckpoint\*(rq,
- \*(lqprintstats\*(rq, \*(lqclearstats\*(rq,
- \*(lqprogress\*(rq,
- \*(lquser\*(rq,
- \*(lqaddvolume\*(rq,
- \*(lqrmvolume\*(rq,
- \*(lqlistvolumes\*(rq,
- \*(lqlistmount\*(rq,
- \*(lqlistdistr\*(rq,
- \*(lqsource\*(rq,
- \*(lqredirect\*(rq.
- When the server is compiled with
- profiling (-DPROFIL, -p), the server accepts the \*(lqprofil\*(rq command.
- When the server is compiled with debugging
- (-DDEBUG), the server also accepts the
- \*(lqtraceflags\*(rq and \*(lqtracelevel\*(rq commands.
- \." TODO: add tracelevel as regular option
- .lp
- The \*(lqhelp\*(rq command provides a list of the
- commands.
- .lp
- The \*(lqshutdown\*(rq command instructs the server to abort all
- active transactions and cleanly shut down.
- The \*(lqkill\*(rq command causes the server to halt immediately after
- displaying the status of mounted volumes.
- The \*(lqcrash\*(rq command has the same effect as the \*(lqkill\*(rq command,
- except that
- a core dump is produced as well.
- .lp
- The \*(lqcheckpoint\*(rq command causes the server to take a checkpoint
- immediately.
- Checkpoints are taken periodically by servers.
- The default frequency is once every 100 log pages, but this
- .(x z
- checkpoint
- .)x \*($n
- .(x z
- checkpoint frequency, default
- .)x \*($n
- .(x z
- default checkpoint frequency
- .)x \*($n
- can be changed by an application program
- (see sm_ChangeCheckpointFrequency(\ ) in
- Section 4.11.2, \fBAdministrative Operations\fR).
- .lp
- The \*(lqprintstats\*(rq command prints general server statistics.
- The \*(lqclearstats\*(rq command clears any counters among the statistics.
- .lp
- The \*(lqprogress\*(rq command reverses the value of the
- \*(lqprogress\*(rq option.
- .lp
- The \*(lquser\*(rq command reverses the value of an internal
- flag that determines whether or not the server prints a message
- when a user (application) error is encountered.
- (There is no option to control this.)
- .\" TODO: add a regular 'user' option
- .lp
- The \*(lqaddvolume\*(rq command
- adds a volume to the server's table of mountable volumes.
- The \*(lqaddvolume\*(rq command
- takes a format-option name
- and a format-option value.
- For example, to add the data volume 8000, type
- .(b
- \fCaddvolume dataformat /path/to/datafile:8000:1:1:300\fR
- .)b
- A volume cannot be added if the given format information
- conflicts with other information in the table.
- .lp
- The \*(lqrmvolume\*(rq command
- removes a volume from
- the server's table of mountable volumes.
- The command takes a volume identifier.
- For example, to remove the data volume 8000, type
- .(b
- \fCrmvolume 8000
- .)b
- A volume cannot be removed if it is in use.
- .lp
- The \*(lqlistvolumes\*(rq command
- prints the server's table of mountable volumes.
- .lp
- The \*(lqlistmount\*(rq command prints a list of the
- volumes that are in some state of use: mounted,
- being mounted or being dismounted.
- It also prints the number of free \*(lqmount slots\*(rq,
- which indicates how many more volumes could be mounted
- at any one time, given the server's configuration.
- To allow more volumes to be mounted at once,
- shut the server down, boost the value of
- the \*(lqmaxaddvolumes\*(rq option, and restart the server.
- .lp
- The \*(lqlistdistr\*(rq command prints information about
- prepared distributed transactions.
- .(x z
- transactions, distributed
- .)x \*($n
- .(x z
- distributed transactions
- .)x \*($n
- These transactions consume space in the log, and
- if they are not aborted or committed, eventually the
- server will fail because it will have run out of log space.
- .(x z
- log space
- .)x \*($n
- See
- Section 4.3, \fBTransactions\fR,
- Section 4.11.1, \fBExternal Two-Phase Commit Functions\fR
- for information about distributed transactions.
- .lp
- The \*(lqsource\*(rq command takes one argument, the path name
- of a file from which to read commands.
- The server processes these commands, and when it reads the
- last command in the file, it resumes reading from
- the terminal.
- If the path name is missing or is
- \fC/dev/tty\fR, reading resumes from the terminal.
- .lp
- The \*(lqredirect\*(rq command takes two arguments.
- The first argument indicates which output stream is to
- be redirected: messages to the terminal or
- error messages.
- The second argument is the path name of a file
- to which the output is written.
- When the output is redirected again, the stream is flushed
- to the given file and the file is closed.
- To redirect output to the terminal, use \fC/dev/tty\fR
- or omit the path name.
- .lp
- The \*(lqprofil\*(rq command causes the server to dump
- its profiling information to disk.
- This command is available only on a server that was compiled
- with profiling on (-DPROFIL -p).
- See the manual page for prof(1).
- .lp
- The \*(lqtraceflags\*(rq command may take an integer argument,
- which may be a hexadecimal number, such as \*(lq0xfa3\*(rq,
- in which case it sets the server's trace flags word to that value.
- The command is available only with a server that was compiled
- with debugging on (-DDEBUG -g).
- The meanings of the trace flags are found in the server's
- source code, in \fCsrc/include/global_trace.h\fR.
- When \*(lqtraceflags\*(rq is used with no argument, it prints the
- value of the trace flags word.
- .lp
- The \*(lqtracelevel\*(rq command is available with a server
- that was compiled with debugging on (-DDEBUG -g).
- When used with no argument, it prints
- the trace level for the trace flags that are on.
- When given an integer argument (1, 2, or 3),
- it sets the trace level for the trace flags that are on.
- .br
- .sh 3 "Shutting Down the Server"
- .lp
- The server can be shut down several ways.
- One method is to use one of the above-mentioned commands.
- Another is to run the \*(lqshutserver\*(rq program, described below,
- at the end of this section.
- A third way to shut down a server is to
- call sm_ShutdownServer(\ ) in a client program.
- .lp
- A server may also shut itself down because of a fatal
- error, such as the unexpected death of a disk process or a bug.
- A fatal error causes the server to report the state of all
- the mounted volumes, dump core, and exit.
- .lp
- The server allocates a Unix System V shared-memory segment and a
- semaphore set when it starts.
- If a server is shut down in a controlled fashion,
- it removes the segment and semaphore set.
- These resources are not
- removed when the server is terminated
- by \fCkill -9 <server process>\fR typed in the shell,
- by the \*(lqkill\*(rq or \*(lqcrash\*(rq command given to the server's terminal monitor,
- or when the server process is killed by a debugger.
- \fBIf you use any one of these means to terminate a
- server, you must use ipcrm(1) to remove the
- resources.\fR
- See the manual pages for ipcs(1) and ipcrm(1) for more
- information.
- If the segments and semaphore sets are not removed,
- eventually the operating system will run out of segments,
- and you will be unable to start a new server.
- .lp
- If a server shuts down without
- having committed or aborted all its active transactions
- and flushed all its dirty pages to disk,
- recovery is required when the server is restarted.
- When a server shuts down, it prints the status of all the
- mounted volumes.
- It indicates if recovery is necessary on those volumes.
- .br
- .sh 4 "Running the Shutserver program"
- .lp
- The \fCshutserver\fR program is invoked:
- .(b
- \fCshutserver [-m machine] [-s servername] [-h]\fR.
- .)b
- The \*(lqmachine\*(rq specifies the name of the machine on which runs the
- server to be shut down.
- If \*(lq-m machine\*(rq is not given, the program
- uses the machine on which \fCshutserver\fR is executed.
- The \*(lqservername\*(rq is the name of the server in \fC/etc/services\fR,
- .\" TODO: this should be -port name.
- If \*(lq-s servername\*(rq is not given, \*(lqexodussm\*(rq is used.
- The \*(lq-h\*(rq option prints a brief help message.
- .br
- .sh 3 "Recovery"
- .lp
- When a server is started after a failure it automatically performs recovery.
- The time it takes for recovery depends on several factors, including
- the number of transactions in progress at the time of the failure,
- the number of log records generated by these transactions,
- and the number of log records generated since the last checkpoint.
- .lp
- Recovery has three phases.
- .(x z
- recovery
- .)x \*($n
- After each phase, the server prints
- information about the time and I/O operations required to perform the
- phase.
- .lp
- The first phase is \fIanalysis\fR.
- The log is scanned to determine what transactions were active
- and which volumes were mounted at the time of the failure.
- .lp
- After analysis, the volumes are mounted
- and the \fIredo\fR phase is performed.
- In the redo phase,
- data are restored to their state at the time
- of the failure.
- .lp
- In the last phase, the \fIundo\fR phase,
- the server aborts
- the transactions that were active at the time of the crash.
- The volumes are dismounted, and a checkpoint is taken.
- .lp
- For details of recovery in the Storage Manager, see
- [Fran92].
- .br
- .sh 2 "Tuning the Server"
- .lp
- There are several tuning parameters in the Storage Manager server.
- The following sections describe each one.
- .br
- .sh 4 "The Size of the Buffer Pool"
- .lp
- The size of a server's buffer pool
- is determined by the \*(lqbufpages\*(rq
- option, which
- indicates the number of MIN_PAGESIZE pages
- in the buffer pool.
- If a server is the primary process on a machine, it
- should have a buffer pool close to
- the size of available shared memory.
- When both an application and a server are running on the
- same machine, choosing a buffer pool size is more difficult.
- A
- \*(lqproper\*(rq
- choice depends on the behavior of the applications
- and their interactions with servers.
- A good rule of thumb is that
- that clients should have the adequate
- buffer space, to minimize client-server interaction.
- .lp
- The buffer pool must fit in the available shared memory
- of the machine on which the server runs.
- The server will let you know if it cannot acquire
- enough shared memory when it starts.
- See the manual pages for
- ipcs(1) and ipcrm(1) to find out how
- much shared memory is in use.
- See your
- system administrator to find out
- how much shared memory has been configured for
- your systems
- if you find that you cannot
- run a server with a buffer pool of adequate size,
- and no shared memory segments are being wasted.
- .br
- .sh 4 "The Size of Log Pages"
- .lp
- The log page size is determined when a log volume is formatted.
- For a transaction mix dominated by
- transactions that generate more than a few
- kilobytes of log information, the larger the log page size, the better.
- For short running transactions,
- such as those found in transaction processing benchmarks,
- 8 Kbyte log pages give good results.
- .br
- .sh 4 "Checkpoint Frequency"
- .lp
- The checkpoint frequency is based on the
- number of log pages written.
- The default frequency is every 100 log pages.
- The frequency can be determined by setting the
- \*(lqcheckpoint\*(rq configuration option.
- .(x z
- checkpoint frequency
- .)x \*($n
- .(x z
- default checkpoint frequency
- .)x \*($n
- It can be changed in a running server by an
- application that calls sm_ChangeCheckpointFrequency(\ ).
- More frequent checkpoints tend to shorten the time
- required to recover after a server fails
- at the expense of processing time during normal operation.
- Checkpoints also cause the server's dirty pages to be flushed
- to disk, which may also improve performance during normal
- operation.
- .bp
- .sh 1 "REFERENCES"
- .sp
- .ip "[Care86]" 10
- M. Carey, D. DeWitt, J. Richardson, and E. Shekita,
- \fIObject and File Management in the EXODUS Extensible Database System\fR,
- \fBProc. of the 1986 VLDB Conf.\fR,
- Kyoto, Japan, Aug. 1986.
- .ip "[Care89]" 10
- M. Carey, D. DeWitt, E. Shekita,
- \fIStorage Management for Objects in EXODUS\fR,
- \fBObject-Oriented Concepts, Databases, and Applications\fR,
- W. Kim and F. Lochovsky, eds., Addison-Wesley, 1989.
- .ip "[Chou85]" 10
- H. Chou and D. Dewitt,
- \fIAn Evaluation of Buffer Management Strategies for Relational Database Systems\fR,
- \fBProc. of the 1985 VLDB Conf.\fR,
- Stockholm, Sweden, Aug. 1985.
- .ip "[Fran92]" 10
- M. Franklin, M. Zwilling, C.K.Tan, M. Carey, and D. DeWitt,
- \fICrash Recovery in Client-Server EXODUS\fR,
- \fBProc. of the ACM SIGMOD Int'l. Conf. on Management of Data\fR,
- San Diego, CA, June 1992.
- .ip "[Gray78]" 10
- J. N. Gray,
- \fINotes on Database Operating Systems\fR,
- \fBLecture Notes in Computer Science 60,
- Advanced course on Operating Systems\fR,
- ed. G. Seegmuller, Springer Verlag, New York 1978.
- .ip "[Gray88]" 10
- J. Gray, R. Lorie, G. Putzolu, I. Traiger,
- \fIGranularity of Locks and Degrees of Consistency in a Shared Data Base\fR,
- \fBReadings in Database Systems\fR,
- ed. M. Stonebraker, Morgan Kaufmann, San Mateo, Ca., 1988.
- .ip "[Litw88]" 10
- W. Litwin,
- \fILinear Hashing: A New Tool for File and Table Addressing\fR,
- \fBReadings in Database Systems\fR,
- ed. M. Stonebraker, Morgan Kaufmann, San Mateo, Ca., 1988.
- .ip "[Moha83]" 10
- C. Mohan, B. Lindsay,
- \fIEfficient Commit Protocols for the Tree of Processes
- Model of Distributed Transactions\fR,
- \fBProc. 2nd ACM SIGACT/SIGOPS Symposium on Principles of Distributed
- Computing\fR,
- Montreal, Canada, August, 1983.
- .ip "[Moha89]" 10
- C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz,
- \fIARIES: A Transaction Recovery Method Supporting
- Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead
- Logging\fR,
- \fIACM Transactions on Database Systems\fR,
- Vol. 17, No 1, March 1992.
- .ip "[Rich87]" 10
- J. Richardson and M. Carey,
- \fIProgramming Constructs for Database System Implementation in EXODUS\fR,
- \fBProc. of the ACM SIGMOD Int'l. Conf. on Management of Data\fR,
- San Francisco, CA, May 1987.
- .ip "[exoArch]" 10
- \fIEXODUS Storage Manager Architecture Overview\fR, unpublished,
- included in EXODUS Storage Manager software release.
- .bp
- .\" use alphabetic section header A.1, A.2, etc.
- .af $1 A
- .nr $1 0
- .af $9 A
- .nr $9 1
- .sh 1 "APPENDIX : Locking Protocol for Storage Manager Operations"
- .lp
- The Storage Manager performs concurrency control using the standard
- hierarchical two-phase locking protocol (see [Gray78], [Gray88])
- .(x z
- locking protocol
- .)x \*($n Appendix
- .(x z
- two-phase locking protocol
- .)x \*($x Appendix
- for locking files and object pages.
- The lock hierarchy contains two granularities: file-level, and page-level.
- Locking for index operations is performed with a non-two-phase protocol,
- that allows multiple clients to read and update the same index.
- This section describes the lock modes used in the system,
- lists the locks requested for each Storage Manager
- file and object operation,
- and explains how deadlocks are handled.
- .(x z
- deadlock
- .)x \*($n Appendix
- Lock acquisition and release are \fIimplicit\fR in
- all relevant operations, so clients cannot explicitly manage their own locks.
- .br
- .sh 2 "Lock Modes"
- .lp
- Files are locked in one of six modes: no lock (NL), shared (S),
- exclusive (X), intent to share (IS), intent to exclusive (IX),
- .(x z
- lock, exclusive
- .)x \*($n Appendix
- share with intent to exclusive (SIX) [Gray78], [Gray88]. Only shared and
- exclusive locks are obtained on pages. Determining whether two locks
- are compatible (eg., when a client holds a lock on a file and another
- client wants to obtain a lock on it as well) can be done using
- a table.
- Table \n($9.1 is a lock compatibility table for the six file
- lock modes. Each row indicates a lock that some client can hold,
- and each column indicates a lock desired by another client. The Y and
- N table entries indicate (yes or no) whether the locks are compatible
- or not.
- .\" ) to match open paren in Table reference above
- .(z
- .TS
- center, tab(#), box ;
- c s s s s s s
- c|c s s s s s
- c|c c c c c c
- l|l l l l l l.
-
- Lock#Lock Requested
- Held#NL#IS#IX#S#SIX#X
- _
- NL#Y#Y#Y#Y#Y#Y
-
- IS#Y#Y#Y#Y#Y#N
-
- IX#Y#Y#Y#N#N#N
-
- S#Y#Y#N#Y#N#N
-
- SIX#Y#Y#N#N#N#N
-
- X#Y#N#N#N#N#N
- .TE
- .ce
- .uh "Table \n($9.1: Lock Compatibility"
- .\" ) to match open paren in .uh above
- .)z
- .lp
- Another table can be used to express lock convertibility.
- A lock conversion occurs when a client holds a lock in some mode and
- requests an operation that requires a different mode for the lock.
- Table \n($9.2 is a lock convertibility table for the six file lock modes.
- Each row indicates a lock that the client already holds and each column
- indicates the new lock mode requested. The entries represent the
- resulting lock mode obtained.
- .\" ) to match open paren Table ref above
- .(z
- .TS
- center, tab(#), box ;
- c s s s s s s
- c|c s s s s s
- c|c c c c c c
- l|l l l l l l.
-
- Lock#Lock Requested
- Held#NL#IS#IX#S#SIX#X
- _
- NL#NL#IS#IX#S#SIX#X
-
- IS#IS#IS#IX#S#SIX#X
-
- IX# IX#IX#IX#SIX#SIX#X
-
- S# S#S#SIX#S#SIX#X
-
- SIX#SIX#SIX#SIX#SIX#SIX#X
-
- X#X#X#X#X#X#X
- .TE
- .ce
- .uh "Table \n($9.2: Lock Convertibility"
- .\" ) to match open paren in .uh above
- .)z
- .sp
- .br
- .sh 2 "Locks Obtained by Operations"
- .lp
- The locks mentioned above are obtained on two types of structures in
- the Storage Manager: files and pages.
- Only the pages that contain object headers and root
- entries are locked; large object data pages
- and file index pages are not locked.
- The entire root entry page is locked
- when a root entry is used.
- .lp
- Table \n($9.3 lists all of the locks
- .\" ) to match open paren above
- obtained by the various Storage Manager operations.
- The column labelled \*(lqFile Lock\*(rq
- indicates what lock mode is used for locking
- the file in question.
- The column labelled \*(lqPage Lock\*(rq
- indicates what lock mode is used for locking
- pages containing the objects or root entries in question.
- Locks are held until the end of the transaction in which they were
- acquired.
- .lp
- Some applications may find it necessary to acquire more
- restrictive locks on pages and files to avoid conflicts
- during lock-upgrade requests.
- For example, consider an application that reads an object (with
- sm_ReadObject(\ )) and subsequently writes it (with sm_WriteObject(\ )).
- When the object is read, a share lock is acquired for the object's page.
- .(x z
- lock, share
- .)x \*($n Appendix
- When the object is written, a lock-upgrade request is
- sent to the server to obtain an exclusive lock on the page.
- .(x z
- lock, exclusive
- .)x \*($n Appendix
- This extra message is relatively expensive and can lead to
- potential deadlock if other clients are locking the page as well.
- .(x z
- deadlock
- .)x \*($n Appendix
- To avoid this problem,
- the
- \*(lqpagelock\*(rq option can be used
- to change the default lock modes used when
- .(x z
- default lock mode
- .)x \*($n Appendix
- the client library locks a page.
- See Table 1 and the discussion of client options
- in
- Section 4.2, \fBInitialization and Shutdown Operations\fR
- for information about setting client options.
- See Appendix A for more information about lock modes and the
- Storage Manager's locking protocols.
- .(z
- .TS
- box, center, tab(#) ;
- c s s s
- c c c c
- l l l l.
-
- Operation#File Lock#Page Lock#Comments
- _
- sm_Initialize(\ )#-#-#no locks needed
- sm_ShutDown(\ )#-#-#no locks needed
- sm_OpenBufferGroup(\ )#-#-#no locks needed
- sm_CloseBufferGroup(\ )#-#-#no locks needed
-
- sm_SetRootEntry(\ )#-#X#root entry page
- sm_GetRootEntry(\ )#-#S#root entry page
- sm_RemoveRootEntry(\ )#-#X#root entry page
-
- sm_CreateFile(\ )#X#-#
- sm_DestroyFile(\ )#X#-#
-
- sm_GetFirstOid(\ )#S#-#
- sm_GetLastOid(\ )#S#-#
- sm_GetNextOid(\ )#S#-#
- sm_GetPreviousOid(\ )#S#-#
-
- sm_OpenScan(\ )#S#-#
- sm_OpenScanWithGroup(\ )#S#-#
- sm_ScanNextObject(\ )#-#-#no locks needed
- sm_CloseScan(\ )#-#-#no locks needed
-
- sm_OpenLoad(\ )#X#-#
- sm_LoadNextObject(\ )#-#-#no locks needed
- sm_CloseLoad(\ )#-#-#no locks needed
-
- sm_CreateObject(\ )#IX#X#unordered file
- sm_DestroyObject(\ )#IX#X#
- sm_ReadObject(\ )#IS#S#
- sm_ReadObjectHeader(\ )#IS#S
- sm_ReleaseObject(\ )#-#-#no locks needed
- sm_WriteObject(\ )#IX#X#
- sm_InsertInObject(\ )#IX#X#
- sm_AppendToObject(\ )#IX#X#
- sm_DeleteFromObject(\ )#IX#X#
-
- sm_CreateVersion(\ )#IX#X#
- sm_FreezeVersion(\ )#IX#X#
- .TE
- .ce
- .uh "Table \n($9.3: Locks Obtained by Operations"
- .\" ) to match open paren above
- .)z
- .(x z
- locks obtained by functions
- .)x \*($n Table
- .sp
- .br
- .sh 2 "Deadlock Detection and Avoidance"
- .lp
- With each lock request, a server analyzes its
- local waits-for graph and detects local cycles, or \*(lqlocal deadlocks\*(rq.
- .(x z
- deadlock avoidance
- .)x \*($n Appendix
- .(x z
- deadlock detection
- .)x \*($n Appendix
- The request that would cause a deadlock is denied (returns
- esmFAILURE),
- and the client library returns esmLOCKCAUSEDDEADLOCK
- to the application in the global variable sm_errno.
- .lp
- Distributed transactions may also cause a deadlock.
- The servers do not detect deadlocks that involve other servers.
- Global deadlocks are avoided by timing out locks.
- Each request that awaits a lock is aged.
- When its age exceeds the time given by the client's \*(lqlocktimeout\*(rq
- option, the request is denied (returns esmFAILURE),
- and the client library returns esmLOCKBUSY
- to the application in the global variable sm_errno.
- .lp
- When an application's request fails with esmLOCKBUSY or esmLOCKCAUSEDDEADLOCK,
- the application must abort its transaction, to free the locks
- it holds, and it must start its transaction again.
- .sp 3
- .bp
- .sh 1 "APPENDIX : Generation of Unique Numbers for OIDs"
- .lp
- The \*(lqunique\*(rq field of an OID is special 32-bit value that is generated
- when the object is created and used to detect instances where the OID
- has become dangling or corrupted.
- The values that are stored in \*(lqunique\*(rq fields are generated by
- Storage Manager servers.
- Disk volumes are partitioned into blocks of 32
- pages, and for each partition a 32-bit counter is maintained.
- When a new page is allocated, it is allotted a range (100) of unique
- numbers to use during object creation.
- The counter in the partition containing the new page is incremented to reflect the allotment.
- When this allotment has been exhausted,
- a request is made to the server for another allotment.
- When an object is created in a particular
- partition, the \*(lqunique\*(rq field of the new object's OID is set to the
- next available number in the range on the page.
- While this strategy does not guarantee that OIDs
- are unique for all time, the
- probability of a dangling OID that
- maps to the same page and the same slot,
- and has the same \*(lqunique\*(rq field as a valid OID is very low.
- As a result, \*(lqunique\*(rq fields can be used virtually to guarantee
- the validity of an OID.
- We adopted this approach instead of using
- unique-for-all-time logical OIDs with a surrogate
- index in order to avoid the extra disk I/Os that might be
- needed to translate a logical OID to a physical address.
- .bp
- .++P
- .sp 0.5i
- .ce 1
- \fBTABLE OF CONTENTS\fR
- .sp 2
- .xp
- .\" .bp
- .\" .++P
- .\" .sp 0.5i
- .\" .ce 1
- .\" \fBINDEX\fR
- .\" .sp 2
- .\" .xp z
-